CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 4 of 4

Thread: GCS-Variation

  1. #1
    Join Date
    May 2011
    Posts
    22

    GCS-Variation

    Hello Experts,

    I have a problem, maybe connected to the greatest common substring Problem.

    There is a file of about 10-20 Mbytes. Inside this file, you can think of it as a textfile, there are few double parts of about 10-50 kbytes. It looks like
    "sometextXXXanothertextXXXlasttextpart", where the XXX-s are identical strings.

    How can I find as fast and as reliable as possible the double parts?

    My first shot is dividing the whole file in single parts like lines or chapters, deriving hash-values und doing the modified gcs with the shorter file. But this is far away from reliable.

    Any Ideas? Thank you!


    GMarco

  2. #2
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: GCS-Variation

    I think this solution is related to the Basic Local Alignment Search Tool (BLAST) algorithm in computational biology (not really sure; haven't read the details in awhile). (Wikipedia about the BLAST algorithm: http://en.wikipedia.org/wiki/BLAST)

    Anyway, the idea is to slide a window of size k (say k=3) across your input and save the location of each k-tuple to a hash table. Every time a hash collision occurs, try to expand the region of similarity between each of the k-tuples. Keep track of the longest one you have found.

    Example:

    Consider string (w/o spaces; only added for readability):
    AAA BBBC DDD BBBC EEE (so BBBC repeats itself)

    Window 1: AAA -> add to hash table
    Window 2: AAB -> add to hash table
    Window 3: ABB -> add to hash table
    Window 4: BBB -> add to hash table
    .....
    Window 11: BBB -> hash collision! Try to expand left and right from window 4 and 11...
    ...
    Until the end

    Some implementation details left to you, obviously. (For example, how to choose k optimally and what to do if the greatest repeat length is < k... probably start with some 'good' k and iteratively decrement it).

    Hope that helps! Good luck!
    Last edited by BioPhysEngr; June 16th, 2011 at 01:34 AM. Reason: add encouragement
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  3. #3
    Join Date
    May 2011
    Posts
    22

    Re: GCS-Variation

    Thank you very much. The BLAST ist about 4.000.000 times faster than everything, wich I had before :-)

    The critical part was to find a fast hash function adapted to the sequence problem. I used a polynome over a prime ring with the value of the letters as the coefficients.

    Best regards,
    GMarco

  4. #4
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: GCS-Variation

    Great! Glad it worked out well.
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured