Important alert: (current site time 9/3/2014 12:17:43 AM EDT)
 

winzip icon

Summary Generator (demonstrates cool algorithm)

Email
Submitted on: 7/9/2000 3:28:24 AM
By: James Vincent Carnicelli  
Level: Advanced
User Rating: By 14 Users
Compatibility: VB 3.0, VB 4.0 (16-bit), VB 4.0 (32-bit), VB 5.0, VB 6.0, VB Script, ASP (Active Server Pages)
Views: 22176
author picture
(About the author)
 
     Here's an easy-to-use utility that can take a chunk of plain text and generate a summary with up to a number of words you specify (e.g., 1000).

This code demonstrates an exceedingly cool algorithm I recently read about that basically works like this. Count all the occurrances of each word in the text. Score each sentence based mainly on how many of the most frequent words are in it (with a few other biases and ignoring dull words like "the"). Pick enough of the highest scoring sentences to meet the maximum word limit for the summary. Assemble these sentences into a summary.

Unbelievably simple, in theory. Not bad in practice, despite the fact that the engine doesn't really "understand" what it's summarizing. All it's doing is picking out the most "representative" sentences.

Included with this demo is an article from Seasoned Cooking magazine (seasoned.com). Try setting the maximum number of words to 100, 200, 300, and so on and see what you get. You can paste any text in you want. Be sure, though, that you help the engine out by putting one or more blank lines between any paragraphs, bullet points, etc.

Also, be aware that all periods are assumed to be end-of-sentence markers, even in abbreviations like "i.e.". This and a few other limitations make this algorithm imperfect, but still very illustrative of one kind of linguistic analysis engine.

I suspect this sort of code is unique on Planet Source Code, so I welcome and encourage your comments. Your vote is also appreciated.

----------------
Recent Updates:
8 November 2000: Engine improvement
- Added a list of "hot words" to bias in favor of sentences with words like "key" and "important"
9 July 2000: Engine improvements
- Ignores words with few characters
- Ignores topmost frequent words
- Ignores lower half of infrequent words
- Better bias towards beginning and end paragraphs


 
winzip iconDownload code

Note: Due to the size or complexity of this submission, the author has submitted it as a .zip file to shorten your download time. Afterdownloading it, you will need a program like Winzip to decompress it.Virus note:All files are scanned once-a-day by Planet Source Code for viruses, but new viruses come out every day, so no prevention program can catch 100% of them. For your own safety, please:
  1. Re-scan downloaded files using your personal virus checker before using it.
  2. NEVER, EVER run compiled files (.exe's, .ocx's, .dll's etc.)--only run source code.
  3. Scan the source code with Minnow's Project Scanner

If you don't have a virus scanner, you can get one at many places on the net including:McAfee.com

 
Terms of Agreement:   
By using this code, you agree to the following terms...   
  1. You may use this code in your own programs (and may compile it into a program and distribute it in compiled format for languages that allow it) freely and with no charge.
  2. You MAY NOT redistribute this code (for example to a web site) without written permission from the original author. Failure to do so is a violation of copyright laws.   
  3. You may link to this code from another website, but ONLY if it is not wrapped in a frame. 
  4. You will abide by any additional copyright restrictions which the author may have placed in the code or code's description.


Other 34 submission(s) by this author

 


Report Bad Submission
Use this form to tell us if this entry should be deleted (i.e contains no code, is a virus, etc.).
This submission should be removed because:

Your Vote

What do you think of this code (in the Advanced category)?
(The code with your highest vote will win this month's coding contest!)
Excellent  Good  Average  Below Average  Poor (See voting log ...)
 

Other User Comments

7/9/2000 4:03:22 AMUlli

Add two more words to the WordsToIgnoreList: (for & on) and see what happens. I don't usually believe in miracles but this makes me think...

(If this comment was disrespectful, please report it.)

 
7/9/2000 4:22:52 AMUlli

PS
Forgot to say: try the minimum size summary then - 43 words
(If this comment was disrespectful, please report it.)

 
7/9/2000 8:10:51 AMUlli

Jim - I've been experimenting a bit: try ignoring short words < 5 chars and short sentences < 10 words. That'll get rid of the ShouldIgnore words and excludes short sentences which do not convey much information
(If this comment was disrespectful, please report it.)

 
7/9/2000 10:24:05 AMJames Vincent Carnicelli

Ulli: I've followed your suggestions, except for the short sentence one. I've also added a number of improvements. The algorithm that inspired this one assumes that the most frequent words are actually not important. It assumes a bell curve of "importance" for words where the peak is somewhere around, say, the 30% most frequently-occurring word. I implemented this notion. Works better, now.
(If this comment was disrespectful, please report it.)

 
7/9/2000 1:01:11 PMdEmOnIc

Nicely done.
(If this comment was disrespectful, please report it.)

 
7/10/2000 2:54:00 AMARRiVE

I have problems with Html files. Otherwise, Great Job!
(If this comment was disrespectful, please report it.)

 
7/10/2000 12:22:30 PMChance

i cant get it to load, what am i doing wrong ?
(If this comment was disrespectful, please report it.)

 
7/12/2000 1:44:25 PMDetonate

AAAARRRGHHHH!!! Why did you not release this code a few years ago before I graduated school!?!? do you have no concern for my grades? *grin*
excellent job :-)

(If this comment was disrespectful, please report it.)

 
7/12/2000 3:09:06 PMJames Vincent Carnicelli

The obvious answer is: it's just a part of a vast conspiracy against your general well-being. :-)
(If this comment was disrespectful, please report it.)

 
8/15/2001 3:47:57 PMRhett Micheletti

James,

A very different piece of work, and it's actually quite useful for certain purposes.

Thank you for sharing your hard work,
Rhett
(If this comment was disrespectful, please report it.)

 
7/17/2004 11:38:11 AMMike Bironneau

In sub SortWordsByOccurance(), an error gets returned if there are less than x words to summarize.
Apart from that, great code.
(If this comment was disrespectful, please report it.)

 
2/8/2005 1:23:40 PMGrayMagiker

I like it. Thank you very much for sharing this with everyone. Five Stars from me, it is a shame I can only vote once for this great code.
(If this comment was disrespectful, please report it.)

 
2/9/2005 12:26:22 PMJames Vincent Carnicelli

Someone asked me "What is the name of the 'exceedingly cool algorithm?' I would like to read more about it." I responded, in part:

The book is called “Advances in Text Summarization”, and can be found at:

- http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=3943

- http://www.amazon.com/exec/obidos/tg/detail/-/0262133598/104-2456268-6559133?v=glance

Continued...
(If this comment was disrespectful, please report it.)

 
2/9/2005 12:26:56 PMJames Vincent Carnicelli

I had found the book in a local Barnes and Noble. A few years later, I did actually buy a copy of the book to aid me on a much more sophisticated project that involved text analysis, but gave it to one of my developers. If you want to learn more about the subject, this book is pretty good. It’s an aggregate of what the editors thought were representative of the train of growth over the prior several decades. As such, it’s not very linear, and the reader is prompted to read many other documents and books cited. And most of the articles included assume you are immersed in the subject, so it’s pretty jargon-heavy.

Continued ...
(If this comment was disrespectful, please report it.)

 
2/9/2005 12:27:20 PMJames Vincent Carnicelli

Here’s my take on the book and the subject. The very first article cited, written sometime back in the 50s or 60s, really sets the stage for all the others. And if your goal is to do something simplistic like I did and Microsoft (MS Word) and others have, that first article is probably good enough. Many of the others focus on sharpening the basic algorithm you see in my code by careful statistical surveys. There is a later section that introduces the concept of “corpus-based summarization”, which is a fancy way of saying, “adding knowledge to a system about one or more particular domains of experience.” If you want a general-purpose summarizer, this is mainly a cool distraction.

(End of message)
(If this comment was disrespectful, please report it.)

 
2/5/2007 10:28:09 PMdennis

will it work on a .NET platform? coz when i try to convert it into VB .NET, i was getting a lots of probs, haiz, maybe im just not good in programming. James, is there anyway to make it work? Thanx for ur hard work, it really helps a lot especially for a news reporter like me.
(If this comment was disrespectful, please report it.)

 
3/16/2012 9:45:06 PMAlex

How exactly do I use this? When I unzip it there's no application file to open it with. Please help!
(If this comment was disrespectful, please report it.)

 

Add Your Feedback
Your feedback will be posted below and an email sent to the author. Please remember that the author was kind enough to share this with you, so any criticisms must be stated politely, or they will be deleted. (For feedback not related to this particular code, please click here instead.)
 

To post feedback, first please login.