|
-
June 11th, 2011, 02:26 AM
#1
GCS-Variation
Hello Experts,
I have a problem, maybe connected to the greatest common substring Problem.
There is a file of about 10-20 Mbytes. Inside this file, you can think of it as a textfile, there are few double parts of about 10-50 kbytes. It looks like
"sometextXXXanothertextXXXlasttextpart", where the XXX-s are identical strings.
How can I find as fast and as reliable as possible the double parts?
My first shot is dividing the whole file in single parts like lines or chapters, deriving hash-values und doing the modified gcs with the shorter file. But this is far away from reliable.
Any Ideas? Thank you!
GMarco
-
June 16th, 2011, 01:33 AM
#2
Re: GCS-Variation
I think this solution is related to the Basic Local Alignment Search Tool (BLAST) algorithm in computational biology (not really sure; haven't read the details in awhile). (Wikipedia about the BLAST algorithm: http://en.wikipedia.org/wiki/BLAST)
Anyway, the idea is to slide a window of size k (say k=3) across your input and save the location of each k-tuple to a hash table. Every time a hash collision occurs, try to expand the region of similarity between each of the k-tuples. Keep track of the longest one you have found.
Example:
Consider string (w/o spaces; only added for readability):
AAA BBBC DDD BBBC EEE (so BBBC repeats itself)
Window 1: AAA -> add to hash table
Window 2: AAB -> add to hash table
Window 3: ABB -> add to hash table
Window 4: BBB -> add to hash table
.....
Window 11: BBB -> hash collision! Try to expand left and right from window 4 and 11...
...
Until the end
Some implementation details left to you, obviously. (For example, how to choose k optimally and what to do if the greatest repeat length is < k... probably start with some 'good' k and iteratively decrement it).
Hope that helps! Good luck!
Last edited by BioPhysEngr; June 16th, 2011 at 01:34 AM.
Reason: add encouragement
Best Regards,
BioPhysEngr
http://blog.biophysengr.net
--
All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.
-
June 30th, 2011, 02:14 AM
#3
Re: GCS-Variation
Thank you very much. The BLAST ist about 4.000.000 times faster than everything, wich I had before :-)
The critical part was to find a fast hash function adapted to the sequence problem. I used a polynome over a prime ring with the value of the letters as the coefficients.
Best regards,
GMarco
-
June 30th, 2011, 02:53 AM
#4
Re: GCS-Variation
Great! Glad it worked out well.
Best Regards,
BioPhysEngr
http://blog.biophysengr.net
--
All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|