CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 7 of 7
  1. #1
    Join Date
    Dec 2008
    Posts
    31

    Advanced String Comparison

    Hi,

    I am writing a program that stores lists of a type of object called 'Fixture'. Each Fixture object has members string team1, string team 2, to identify it. However I'm running into problems when trying to search through these lists for specific fixtures, e.g. a Fixture might have team1="A Villa", team2="West Brom", or another list may have team1="Aston Villa", team2="W Brom".

    I need my code to recognise that these two things are actually the same fixture. The standard functions like '.Contains()' don't really do it. Anyone got any ideas? I was using some algorithm called the Levenshtein distance to gauge the similarity of two strings, and it's ok, but it also matches things that aren't the same sometimes, which is a pain.

    Cheers

  2. #2
    Join Date
    Jun 2008
    Posts
    2,477

    Re: Advanced String Comparison

    Is it not possible to standardize the data beforehand?

  3. #3
    Join Date
    Dec 2008
    Posts
    31

    Re: Advanced String Comparison

    Not really, since I'm scraping these fixtures from various websites so the format it comes in is set by them. If it was just a matter of a few teams then obviously I could write code to account for the fact A Villa equals Aston Villa etc, but there are actually quite a lot of teams that can be written in similar but not identical ways.

  4. #4
    Join Date
    Jun 2008
    Posts
    2,477

    Re: Advanced String Comparison

    Hmmm, I don't know what to say then. Unless you can account for all of the possible formats, you will have to use a method like the one you are already using. Maybe someone else around here has run into a similar problem.

  5. #5
    Join Date
    Mar 2002
    Location
    St. Petersburg, Florida, USA
    Posts
    12,125

    Re: Advanced String Comparison

    One alternative is to put the items in a collection rahter than independant fields. Then just5 implement an IComparer that sorts the items before comparing. Therefore things with the same set (regardless of order) will be identified as equal.
    TheCPUWizard is a registered trademark, all rights reserved. (If this post was helpful, please RATE it!)
    2008, 2009,2010
    In theory, there is no difference between theory and practice; in practice there is.

    * Join the fight, refuse to respond to posts that contain code outside of [code] ... [/code] tags. See here for instructions
    * How NOT to post a question here
    * Of course you read this carefully before you posted
    * Need homework help? Read this first

  6. #6
    Join Date
    Dec 2008
    Posts
    31

    Re: Advanced String Comparison

    It's not so much the order that is the problem (although that technique is useful to me for something else, so thanks ). It's the fact that different websites have Aston Villa, A Villa...etc so it's just the comparison of single strings really. Just wondering what the best algorithm is to do this..

  7. #7
    Join Date
    Mar 2002
    Location
    St. Petersburg, Florida, USA
    Posts
    12,125

    Re: Advanced String Comparison

    This is a very difficult situation when converting "raw" data into "structured" data. [And I have done this for some massive programs]

    Consider that in some cases a single letter difference may have a complete different meaning, while items which differ greatly (looked at as a character sequence) mean the same thing.

    Generally (but by no means in all cases), a "dictionary" of "standard" terminology along with "synonmyms" is the best way to go. Then when you scrape the sites, you keep the original data for accuracy, but always use sanitized data for processing.

    Doubling the storage requirements usually, provides better results than doing the sanitization on the fly (consider indexed operations.....)
    TheCPUWizard is a registered trademark, all rights reserved. (If this post was helpful, please RATE it!)
    2008, 2009,2010
    In theory, there is no difference between theory and practice; in practice there is.

    * Join the fight, refuse to respond to posts that contain code outside of [code] ... [/code] tags. See here for instructions
    * How NOT to post a question here
    * Of course you read this carefully before you posted
    * Need homework help? Read this first

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured