String to long for Regex?
Hello!
I have a website loaded into a string which is 1262594 chars long. I want to do a RegExp-search on it to find all the links to a page like:
Code:
Pattern = "<a(.[^<>]*)href([ \\s]*)=([ \\s'\"]*?)(http://|https://)([^<>'\"\\?]*?)(example.com)([^'\"> ]*)(['\" ]*)(.*?)>(.*?)</a>";
Regex myRegex = new Regex(Pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Console.WriteLine("Done Regex");
MatchCollection mc = myRegex.Matches(html);
Console.WriteLine("Done Matches");
if (mc.Count == 0) {
Console.WriteLine("Done mc.Count");
}
Console.WriteLine("Done all");
This works fine for shorter strings, but the program hangs-up itself using long string: The main-window just freezes, no exception called or anything else. I waited about 30 minutes and then killed the process.
The output is:
Done Regex
Done Matches
... so it seems that the if (mc.Count == 0) crashes somehow.
When setting a breakpoint at the if (mc.Count == 0) and look at mc.Count in the Auto-Watch-Window, I get:
Count Function evaluation disabled because a previous function evaluation timed out. You must continue execution to reenable function evaluation. int
Step a line further crashes the applicaiton as well.
Any ideas about that?
Re: String to long for Regex?
you're using "." dots in your expression without escaping them. do you really mean to match any character there?
Re: String to long for Regex?
Thanks for your reply.
Changing the pattern to (see the bold part):
Code:
Before:
Pattern = "<a(.[^<>]*)href([ \\s]*)=([ \\s'\"]*?)(http://|https://)([^<>'\"\\?]*?)(example.com)([^'\"> ]*)(['\" ]*)(.*?)>(.*?)</a>";
After:
Pattern = "<a(.[^<>]*)href([ \\s]*)=([ \\s'\"]*?)(http://|https://)([^<>'\"\\?]*?)(example.com)([^'\"> ]*)(['\" ]*)(.[^>]*?)>(.*?)</a>";
did the trick and it runs without problems again.
If you see any other optimizations, I would be glad to know. I don't know if there is a way to optimize the last (.*?) before the </a> (because here every char matches, except a </a> is following).
Re: String to long for Regex?
this "([ \\s]*)" is actually the same as "([\\s]*)" becase \s already matches whitespace characters
Re: String to long for Regex?
You are right, thank you.