Click to See Complete Forum and Search --> : Advanced String Comparison


Bobby_1234
January 3rd, 2009, 07:00 PM
Hi,

I am writing a program that stores lists of a type of object called 'Fixture'. Each Fixture object has members string team1, string team 2, to identify it. However I'm running into problems when trying to search through these lists for specific fixtures, e.g. a Fixture might have team1="A Villa", team2="West Brom", or another list may have team1="Aston Villa", team2="W Brom".

I need my code to recognise that these two things are actually the same fixture. The standard functions like '.Contains()' don't really do it. Anyone got any ideas? I was using some algorithm called the Levenshtein distance to gauge the similarity of two strings, and it's ok, but it also matches things that aren't the same sometimes, which is a pain.

Cheers

BigEd781
January 3rd, 2009, 07:24 PM
Is it not possible to standardize the data beforehand?

Bobby_1234
January 3rd, 2009, 07:50 PM
Not really, since I'm scraping these fixtures from various websites so the format it comes in is set by them. If it was just a matter of a few teams then obviously I could write code to account for the fact A Villa equals Aston Villa etc, but there are actually quite a lot of teams that can be written in similar but not identical ways.

BigEd781
January 3rd, 2009, 08:00 PM
Hmmm, I don't know what to say then. Unless you can account for all of the possible formats, you will have to use a method like the one you are already using. Maybe someone else around here has run into a similar problem.

TheCPUWizard
January 3rd, 2009, 08:50 PM
One alternative is to put the items in a collection rahter than independant fields. Then just5 implement an IComparer that sorts the items before comparing. Therefore things with the same set (regardless of order) will be identified as equal.

Bobby_1234
January 3rd, 2009, 08:56 PM
It's not so much the order that is the problem (although that technique is useful to me for something else, so thanks :) ). It's the fact that different websites have Aston Villa, A Villa...etc so it's just the comparison of single strings really. Just wondering what the best algorithm is to do this..

TheCPUWizard
January 3rd, 2009, 09:04 PM
This is a very difficult situation when converting "raw" data into "structured" data. [And I have done this for some massive programs]

Consider that in some cases a single letter difference may have a complete different meaning, while items which differ greatly (looked at as a character sequence) mean the same thing.

Generally (but by no means in all cases), a "dictionary" of "standard" terminology along with "synonmyms" is the best way to go. Then when you scrape the sites, you keep the original data for accuracy, but always use sanitized data for processing.

Doubling the storage requirements usually, provides better results than doing the sanitization on the fly (consider indexed operations.....)