CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 4 of 4
  1. #1
    Guest

    If you can figure this out, I'll give you a cookie

    Okay, what I'm doing is creating a word search engine. I am OCR'ing tiff images and then removing all repetative words from the files generated (one file per page). There are around 60,000 pages total (the files are grouped into folders for each image). I need to find a word, tell what page a word was found on, and do it quickly. Right now, I'm opening a file, converting it to a binary array, and then finding my word(s) in it, recording my results in an array I have, then moving to the next file. This takes about 1min 20sec for only 4K files. This isn't allowable, especially since the search is for a website as well as a standard exe.

    I have tried using memofields in an access database, but after the database got above 5 Megs, crazy things started happening, and _ANY_ sql querys were coming up with bad results.

    I'm thinking of parsing all the OCR'ed pages into one file for each tiff image, but I'm not sure if there will be a speed advtange to that. (Have to spend extra time stepping through the file, keeping track of which 'page' I'm on, grabbing a Mid of the file for every page, searching that string, blah blah blah.) Doing that will bring it down to around 400 files.

    So anybody have any good ideas for the design of a word search engine? (I don't need source code nessarly, just a description of how it works.)
    (Working on this makes me wonder - How in the hell does infoseek, yahoo, and the other web search engines GO SO **** FAST???!!!)

    Ideas appreciated,
    BrewGuru99


  2. #2
    Join Date
    May 1999
    Location
    Oxford UK
    Posts
    1,459

    Re: If you can figure this out, I'll give you a cookie

    When you read in your files, how exactly are you searching for the words ?

    You could read the whole file into a string, then use 'InStr' or 'Like' ro find out if the word exists in the file - using instr should give you a position from which you could work out what page you're on.

    >(Working on this makes me wonder - How in the hell does infoseek, yahoo,
    >and the other web search engines GO SO **** FAST???!!!)

    Like you say - they have a proper word index (I used to work for a Library Software provider, we indexed every word in titles/subjects etc - it get's pretty complicated), plus the fact that InfoSeek, Lycos etc are probably using Perl/C scripts which handle strings about 10x faster than VB can.

    (can I get half a cookie for this half-answer?)


    Chris Eastwood

    CodeGuru - the website for developers
    http://codeguru.developer.com/vb

  3. #3
    Join Date
    Sep 1999
    Posts
    202

    Re: If you can figure this out, I'll give you a cookie

    Maybe you can use MS SQL Server, or its 'Full Text Search' concept.
    http://msdn.microsoft.com/library/sd...8_ar_sa_31.htm
    http://msdn.microsoft.com/library/sd.../8_qd_15_1.htm


  4. #4
    Guest

    Re: If you can figure this out, I'll give you a cookie

    >Like you say - they have a proper word index (I used to work for a Library Software provider, we indexed every word in titles/subjects etc - it get's pretty complicated)

    How does a word index work? Would you please explain this?


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured