CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 10 of 10
  1. #1
    Join Date
    Sep 2004
    Posts
    57

    Algorithm that searches for multiple patterns in a string?

    I'm wondering if there is an algorithm out there that searches for multiple patterns in a string. I need it to also find out if each pattern is a certain distance from the pattern before it. Is there anything like this? I've been using the Boyer-Moore algorithm for a very efficient single exact pattern search, but I really need something that will search for multiple patterns x distance from the previous pattern.

    Thanks,
    Jon

  2. #2
    Join Date
    Jun 2006
    Location
    Bangalore, India
    Posts
    332

    Re: Algorithm that searches for multiple patterns in a string?

    Rabin-karp is the algorithm you are searching for. Its not efficient for single pattern matching(when compared to Boyer-Moore or Knuth-morris) but obviously better than the two in multiple pattern searching problem. Also hash function used in Rabin-Karp can exploit the distance of the successive patterns.

    Several variants of Rabin-Karp are also used in Bioinformatics to sequence strings of DNA for which multiple pattern matching with deletion, insertion, evolution... of letters of the string are also taken into account.

  3. #3

    Re: Algorithm that searches for multiple patterns in a string?

    OK
    Suffix tree I think it`s a good idea.
    And General suffix tree for multiple patterns...

    Dimitris

    * I use the book Algorithms on Strings, Trees and sequences (Gusfield)
    I think that it has what you need on string [a good book for Bioinformatics]
    Last edited by JimK; June 12th, 2006 at 11:25 AM.
    You talking to me?

  4. #4
    Join Date
    Sep 2004
    Posts
    57

    Re: Algorithm that searches for multiple patterns in a string?

    I tested out the Rabin-Karp algorithm and I'm not sure if it will be fast enough. Maybe it's the way I implemented it, I'm not sure. I found the algorithm on wikipedia( http://en.wikipedia.org/wiki/Rabin-K...arch_algorithm ) and this is what I whipped up:

    Code:
            public int rabinKarp()
            {
                
                string strBig = utf.GetString(bigBytes); // "string" to search
                string[] strSubs = new string[smallBytes.Length]; // search patterns
    
                Hashtable hsubs = new Hashtable();
                for (int i = 0; i < strSubs.Length; i++)
                {
                    strSubs[i] = utf.GetString(smallBytes[i]);
                    hsubs.Add(i, strSubs[i].GetHashCode());
                }
    
                int patternLength = strSubs[0].Length; // all patterns are of the same length
                int hs;
                for(int i=0; i<strBig.Length-patternLength; i++)
                {
                    hs = strBig.Substring(i, patternLength).GetHashCode();
                    if (hsubs.ContainsValue(hs))
                    {
                        if (strBig.Substring(i, patternLength).GetHashCode() == hs)
                            return i;
                    }
                }
    
                return -1;
            }
    While this only finds the first pattern, it took around 300 milliseconds to complete. This seems a little slow for me.

    This is what I'm trying to do: I have two images. One is a large image(1280x1024 screenshot of my desktop) and one is a smaller image(50x50 that I copied and pasted from the larger image. Each row of pixels(stored in a byte array) of the smaller image is used as a pattern. So, I need each pattern to be x width from the previous pattern, and if all are found with equidistance, then the subimage is found.

    I'm using a different method right now with the Boyer-Moore algorith(just search for the very first row of pixels of the smaller image, then loop through manually checking pixels to make a match). This works, for the most part, but it is not always reliable or consistent. Search times can range from 1 second to under 2 milliseconds, depending on the size of the subimage and other factors. I'm wanting a reliable and consistent method that would only take under 10-20 milliseconds. Is this reasonable?

    Any input on this is greatly appreciated.

  5. #5
    Join Date
    Aug 2002
    Location
    Madrid
    Posts
    4,588

    Re: Algorithm that searches for multiple patterns in a string?

    I think your current solution is a much better approach. The general multiple substring searches will always be slower.

    In order to minimise the chances of degenerate cases, you could search for the line with the highest variance in the sub image instead of the first line. Then it would be hard to find a case where this degenerates and produces long search times.

    Edit: Ok, there are degenerate cases for that one too. One possible way around them is to first look for the horizontal row with the highest variance and when that is found, check whether the matching vertical line with highest variance also matches.

    Since you can compute the variance of each line once at the start and don't need to do it later anymore, I don't think it's much of an overhead.
    Last edited by Yves M; June 12th, 2006 at 03:14 PM.
    Get this small utility to do basic syntax highlighting in vBulletin forums (like Codeguru) easily.
    Supports C++ and VB out of the box, but can be configured for other languages.

  6. #6
    Join Date
    Jun 2002
    Location
    Moscow, Russia.
    Posts
    2,176

    Re: Algorithm that searches for multiple patterns in a string?

    You got hash code calculation wrong. You should caclulate cyclic hash, so that you need to get only single symbol each iteration. It's all there in wikipedia article.
    "Programs must be written for people to read, and only incidentally for machines to execute."

  7. #7
    Join Date
    Sep 2004
    Posts
    57

    Re: Algorithm that searches for multiple patterns in a string?

    Quote Originally Posted by Yves M
    I think your current solution is a much better approach. The general multiple substring searches will always be slower.

    In order to minimise the chances of degenerate cases, you could search for the line with the highest variance in the sub image instead of the first line. Then it would be hard to find a case where this degenerates and produces long search times.

    Edit: Ok, there are degenerate cases for that one too. One possible way around them is to first look for the horizontal row with the highest variance and when that is found, check whether the matching vertical line with highest variance also matches.

    Since you can compute the variance of each line once at the start and don't need to do it later anymore, I don't think it's much of an overhead.
    I like this idea a lot. So far I've got the search time for a a 1280x1024 screenshot down to an average of 5 milliseconds. This is of course if the first line has a decent amount of variance. Then, when the sub image's first line is say, all blue like my whole desktop wallpaper, it goes up to around a second. So, your idea of searching for the line of most variance is great.

    I have a question about it though. By the highest variance, do you mean the line with the highest standard deviation? If so, what would be the most efficient way to calculate the variance of each line my sub image? I found this function ( http://www.eggheadcafe.com/articles/...ion_dotnet.asp ) but doing this multiple times(say 50, if that is the height of my sub image) would significantly add time to the search when I'm talkin in milliseconds. Though, it could be worth it with the cases I described above that take around a second to calculate.

    Thanks for yalls input so far!

  8. #8
    Join Date
    Aug 2002
    Location
    Madrid
    Posts
    4,588

    Re: Algorithm that searches for multiple patterns in a string?

    Well, just don't take the square root and use something like:
    Code:
    // Calculate mean
    long mean = 0;
    for (int i = 0; i < width; ++i) {
      mean += subi[i];
    }
    mean /= width;
    // Calculate variance
    long variance = 0, val;
    for (int i = 0; i < width; ++i) {
      val = subi[i] - mean;
      variance += val * val;
    }
    There may be better ways to get a variance-like number which you could calculate faster (e.g. try abs instead of squaring etc).
    Get this small utility to do basic syntax highlighting in vBulletin forums (like Codeguru) easily.
    Supports C++ and VB out of the box, but can be configured for other languages.

  9. #9
    Join Date
    Sep 2004
    Posts
    57

    Re: Algorithm that searches for multiple patterns in a string?

    Again, thanks for your help! A search that would take a second now takes 5 milliseconds, while only adding 3-5 milliseconds of overhead finding the highest variant line (Even with a height of 250!).

  10. #10
    Join Date
    Aug 2002
    Location
    Madrid
    Posts
    4,588

    Re: Algorithm that searches for multiple patterns in a string?

    Cool, glad I could help
    Get this small utility to do basic syntax highlighting in vBulletin forums (like Codeguru) easily.
    Supports C++ and VB out of the box, but can be configured for other languages.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured