Results 1 to 1 of 1

Thread: Near Duplicate Detection and Hashing

Thread Tools
Display
- Linear Mode
- Switch to Hybrid Mode
- Switch to Threaded Mode

August 24th, 2011, 08:28 PM #1
PersianAussie

View Profile

View Forum Posts

Junior Member
Join Date

Aug 2011

Posts

5
Near Duplicate Detection and Hashing

Hi
I’m struggling to find near-duplicates in large texts. Greedy-String-Tiling (GST) works pretty well but it is very slow. SimHash is fast and reliable but not only it has been patented by Google but also no C# implementation of it exists.
Please note that Likeness of two strings is different from Near-Duplication. For examile:
“This is a book” is similar to “Book a is This” because words of the first sentence exist in the second string. However, regarding SimHash and GST, the degree of duplication is 0 (they work on minimum 2-word shingling).
One quick way of examining the likeness of two strings is to shingle the strings into n-gram pieces (shingles) and convert each Shingle to a Hash-value and use Jaccard Index calculate the percentage of likeness:
http://en.wikipedia.org/wiki/Jaccard_coefficient
This link includes the explanation of sample code of this approach:
http://raza-rizvi.blogspot.com/2009/...g-md5-and.html
Now my question is that:
The approach mentioned in the link above only works on n-gram shingles. For example on 2-pair words (e.g. this is, is a, a book). But I reckon that the bigger shingles (e.g. 4-word shingles) must have higher weights. For instance, “this is” found in “This is a book” we must give it a weight of 10. If “This is a” is found, the weight 20 must be given. How can we do this (giving weight) along with Hashing? I don’t want to compare and/or store strings as it’s very slow. I need to use Hash values.

Reply With Quote

Quick Navigation C-Sharp Programming Top

« Previous Thread | Next Thread »

Posting Permissions

You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
[VIDEO] code is On
HTML code is Off

Forum Rules

Click Here to Expand Forum to Full Width

Featured

The Best Reasons to Target Windows 8

* Porting from Android to Windows 8: The Real Story
Do you have an Android application? How hard would it really be to port to Windows 8?
* Guide to Porting Android Applications to Windows 8
If you've already built for Android, learn what do you really need to know to port your application to Windows Phone 8.
* HTML5 Development Center
Our portal for articles, videos, and news on HTML5, CSS3, and JavaScript
* Windows App Gallery
See the Windows 8.x apps we've spotlighted or submit your own app to the gallery!

Thread: Near Duplicate Detection and Hashing

Thread Tools

Display

Near Duplicate Detection and Hashing

Posting Permissions