-
GCS-Variation
Hello Experts,
I have a problem, maybe connected to the greatest common substring Problem.
There is a file of about 10-20 Mbytes. Inside this file, you can think of it as a textfile, there are few double parts of about 10-50 kbytes. It looks like
"sometextXXXanothertextXXXlasttextpart", where the XXX-s are identical strings.
How can I find as fast and as reliable as possible the double parts?
My first shot is dividing the whole file in single parts like lines or chapters, deriving hash-values und doing the modified gcs with the shorter file. But this is far away from reliable.
Any Ideas? Thank you!
GMarco
-
Re: GCS-Variation
I think this solution is related to the Basic Local Alignment Search Tool (BLAST) algorithm in computational biology (not really sure; haven't read the details in awhile). (Wikipedia about the BLAST algorithm: http://en.wikipedia.org/wiki/BLAST)
Anyway, the idea is to slide a window of size k (say k=3) across your input and save the location of each k-tuple to a hash table. Every time a hash collision occurs, try to expand the region of similarity between each of the k-tuples. Keep track of the longest one you have found.
Example:
Consider string (w/o spaces; only added for readability):
AAA BBBC DDD BBBC EEE (so BBBC repeats itself)
Window 1: AAA -> add to hash table
Window 2: AAB -> add to hash table
Window 3: ABB -> add to hash table
Window 4: BBB -> add to hash table
.....
Window 11: BBB -> hash collision! Try to expand left and right from window 4 and 11...
...
Until the end
Some implementation details left to you, obviously. (For example, how to choose k optimally and what to do if the greatest repeat length is < k... probably start with some 'good' k and iteratively decrement it).
Hope that helps! Good luck!
-
Re: GCS-Variation
Thank you very much. The BLAST ist about 4.000.000 times faster than everything, wich I had before :-)
The critical part was to find a fast hash function adapted to the sequence problem. I used a polynome over a prime ring with the value of the letters as the coefficients.
Best regards,
GMarco
-
Re: GCS-Variation
Great! Glad it worked out well.