October 5th, 1999, 04:19 PM
Okay, what I'm doing is creating a word search engine. I am OCR'ing tiff images and then removing all repetative words from the files generated (one file per page). There are around 60,000 pages total (the files are grouped into folders for each image). I need to find a word, tell what page a word was found on, and do it quickly. Right now, I'm opening a file, converting it to a binary array, and then finding my word(s) in it, recording my results in an array I have, then moving to the next file. This takes about 1min 20sec for only 4K files. This isn't allowable, especially since the search is for a website as well as a standard exe.
I have tried using memofields in an access database, but after the database got above 5 Megs, crazy things started happening, and _ANY_ sql querys were coming up with bad results.
I'm thinking of parsing all the OCR'ed pages into one file for each tiff image, but I'm not sure if there will be a speed advtange to that. (Have to spend extra time stepping through the file, keeping track of which 'page' I'm on, grabbing a Mid of the file for every page, searching that string, blah blah blah.) Doing that will bring it down to around 400 files.
So anybody have any good ideas for the design of a word search engine? (I don't need source code nessarly, just a description of how it works.)
(Working on this makes me wonder - How in the hell does infoseek, yahoo, and the other web search engines GO SO **** FAST???!!!)
Ideas appreciated,
BrewGuru99
I have tried using memofields in an access database, but after the database got above 5 Megs, crazy things started happening, and _ANY_ sql querys were coming up with bad results.
I'm thinking of parsing all the OCR'ed pages into one file for each tiff image, but I'm not sure if there will be a speed advtange to that. (Have to spend extra time stepping through the file, keeping track of which 'page' I'm on, grabbing a Mid of the file for every page, searching that string, blah blah blah.) Doing that will bring it down to around 400 files.
So anybody have any good ideas for the design of a word search engine? (I don't need source code nessarly, just a description of how it works.)
(Working on this makes me wonder - How in the hell does infoseek, yahoo, and the other web search engines GO SO **** FAST???!!!)
Ideas appreciated,
BrewGuru99