I've been using a big regular expression and Java's regex classes (Pattern, Matcher, etc) to find IPv6s in large chunks of text. It works, but I'm thinking a custom algorithm designed to only find IPv6s could run much faster.
Anyone know of fast IPv6 find algorithm written in Java and in the public domain? It has to find all valid IPv6s in a String, collapsed and expanded.
Thanks for the link, I actually took a look at that article a while ago. But even the most optimized regular expression is still going to be significantly slower than a custom FSM to match a specific pattern. The custom algorithm could take shortcuts, use custom data structures, avoid backtracking, etc that a more general engine like Java's full featured regexp engine can't.
This part of our application is critical--we're searching very large amounts of data for IPv6s. I just wanted to see if something was out there before I take the plunge and write it myself...
There is nothing that I am aware of for your specific needs. If speed is really an issue, you might want to thing about distributing the work load rather than trying to maximize the algorithm. The algorithm can only be improved so much before you are going to cap out, and the sheer amount of data is going to just be too much.
Do you have the ability to distribute the work load among several servers? If you have a couple of machines that you can use (even if it's just one other machine with multiple cores) you should look into maybe using something like Hadoop (MapReduce for Java). You can supply custom classes that parse / handle chunks of data passed in in a distributed fashion. This way it is at least being completed in parallel, since you will only be able to speed up your regex search so much.
I think you will get the most performance gain by distributing the process rather than trying to improve the algorithm.
Good suggestion--I've been wanting an excuse to learn about Hadoop. I'm keeping that idea in my back pocket in case its needed, but it increases the complexity and effort if we go down that road. Even if we distribute the work, it still makes a lot of sense to optimize the find algorithm.
Our current data suggested a temporary solution--given that IPv6s are still pretty rare, most of the records we're searching didn't have them. So, it was a lot faster to just write a custom algorithm that can eliminate records that don't have any IPv6s rather than trying to find all IPv6s in each record. Now we just run the big ugly regex on records we suspect have IPv6s. The problem is that over time, more and more IPv6s will start showing up in records, so this approach will get worse and worse. But its much faster atm.