Hi
I’m struggling to find near-duplicates in large texts. Greedy-String-Tiling (GST) works pretty well but it is very slow. SimHash is fast and reliable but not only it has been patented by Google but also no C# implementation of it exists.
Please note that Likeness of two strings is different from Near-Duplication. For examile:
“This is a book” is similar to “Book a is This” because words of the first sentence exist in the second string. However, regarding SimHash and GST, the degree of duplication is 0 (they work on minimum 2-word shingling).
One quick way of examining the likeness of two strings is to shingle the strings into n-gram pieces (shingles) and convert each Shingle to a Hash-value and use Jaccard Index calculate the percentage of likeness:
http://en.wikipedia.org/wiki/Jaccard_coefficient
This link includes the explanation of sample code of this approach:
http://raza-rizvi.blogspot.com/2009/...g-md5-and.html
Now my question is that:
The approach mentioned in the link above only works on n-gram shingles. For example on 2-pair words (e.g. this is, is a, a book). But I reckon that the bigger shingles (e.g. 4-word shingles) must have higher weights. For instance, “this is” found in “This is a book” we must give it a weight of 10. If “This is a” is found, the weight 20 must be given. How can we do this (giving weight) along with Hashing? I don’t want to compare and/or store strings as it’s very slow. I need to use Hash values.