Click to See Complete Forum and Search --> : If you can figure this out, I'll give you a cookie


October 5th, 1999, 04:19 PM
Okay, what I'm doing is creating a word search engine. I am OCR'ing tiff images and then removing all repetative words from the files generated (one file per page). There are around 60,000 pages total (the files are grouped into folders for each image). I need to find a word, tell what page a word was found on, and do it quickly. Right now, I'm opening a file, converting it to a binary array, and then finding my word(s) in it, recording my results in an array I have, then moving to the next file. This takes about 1min 20sec for only 4K files. This isn't allowable, especially since the search is for a website as well as a standard exe.

I have tried using memofields in an access database, but after the database got above 5 Megs, crazy things started happening, and _ANY_ sql querys were coming up with bad results.

I'm thinking of parsing all the OCR'ed pages into one file for each tiff image, but I'm not sure if there will be a speed advtange to that. (Have to spend extra time stepping through the file, keeping track of which 'page' I'm on, grabbing a Mid of the file for every page, searching that string, blah blah blah.) Doing that will bring it down to around 400 files.

So anybody have any good ideas for the design of a word search engine? (I don't need source code nessarly, just a description of how it works.)
(Working on this makes me wonder - How in the hell does infoseek, yahoo, and the other web search engines GO SO **** FAST???!!!)

Ideas appreciated,
BrewGuru99

Chris Eastwood
October 5th, 1999, 04:45 PM
When you read in your files, how exactly are you searching for the words ?

You could read the whole file into a string, then use 'InStr' or 'Like' ro find out if the word exists in the file - using instr should give you a position from which you could work out what page you're on.

>(Working on this makes me wonder - How in the hell does infoseek, yahoo,
>and the other web search engines GO SO **** FAST???!!!)

Like you say - they have a proper word index (I used to work for a Library Software provider, we indexed every word in titles/subjects etc - it get's pretty complicated), plus the fact that InfoSeek, Lycos etc are probably using Perl/C scripts which handle strings about 10x faster than VB can.

(can I get half a cookie for this half-answer?)


Chris Eastwood

CodeGuru - the website for developers
http://codeguru.developer.com/vb

Bruno
October 5th, 1999, 06:09 PM
Maybe you can use MS SQL Server, or its 'Full Text Search' concept.
http://msdn.microsoft.com/library/sdkdoc/sql/8_ar_sa_31.htm
http://msdn.microsoft.com/library/sdkdoc/sql/8_qd_15_1.htm

October 5th, 1999, 06:23 PM
>Like you say - they have a proper word index (I used to work for a Library Software provider, we indexed every word in titles/subjects etc - it get's pretty complicated)

How does a word index work? Would you please explain this?