-
December 21st, 2009, 08:53 AM
#1
String to long for Regex?
Hello!
I have a website loaded into a string which is 1262594 chars long. I want to do a RegExp-search on it to find all the links to a page like:
Code:
Pattern = "<a(.[^<>]*)href([ \\s]*)=([ \\s'\"]*?)(http://|https://)([^<>'\"\\?]*?)(example.com)([^'\"> ]*)(['\" ]*)(.*?)>(.*?)</a>";
Regex myRegex = new Regex(Pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Console.WriteLine("Done Regex");
MatchCollection mc = myRegex.Matches(html);
Console.WriteLine("Done Matches");
if (mc.Count == 0) {
Console.WriteLine("Done mc.Count");
}
Console.WriteLine("Done all");
This works fine for shorter strings, but the program hangs-up itself using long string: The main-window just freezes, no exception called or anything else. I waited about 30 minutes and then killed the process.
The output is:
Done Regex
Done Matches
... so it seems that the if (mc.Count == 0) crashes somehow.
When setting a breakpoint at the if (mc.Count == 0) and look at mc.Count in the Auto-Watch-Window, I get:
Count Function evaluation disabled because a previous function evaluation timed out. You must continue execution to reenable function evaluation. int
Step a line further crashes the applicaiton as well.
Any ideas about that?
Last edited by martho; December 21st, 2009 at 08:58 AM.
-
December 21st, 2009, 09:08 AM
#2
Re: String to long for Regex?
you're using "." dots in your expression without escaping them. do you really mean to match any character there?
win7 x86, VS 2008 & 2010, C++/CLI, C#, .NET 3.5 & 4.0, VB.NET, VBA... WPF is comming
remeber to give feedback you think my response deserves recognition? perhaps you may want to click the Rate this post link/button and add to my reputation
private lessons are not an option so please don't ask for help in private, I won't replay
if you use Opera and you'd like to have the tab-button functionality for the texteditor take a look at my Opera Tab-UserScirpt; and if you know how to stop firefox from jumping to the next control when you hit tab let me know
-
December 21st, 2009, 09:25 AM
#3
Re: String to long for Regex?
Thanks for your reply.
Changing the pattern to (see the bold part):
Code:
Before:
Pattern = "<a(.[^<>]*)href([ \\s]*)=([ \\s'\"]*?)(http://|https://)([^<>'\"\\?]*?)(example.com)([^'\"> ]*)(['\" ]*)(.*?)>(.*?)</a>";
After:
Pattern = "<a(.[^<>]*)href([ \\s]*)=([ \\s'\"]*?)(http://|https://)([^<>'\"\\?]*?)(example.com)([^'\"> ]*)(['\" ]*)(.[^>]*?)>(.*?)</a>";
did the trick and it runs without problems again.
If you see any other optimizations, I would be glad to know. I don't know if there is a way to optimize the last (.*?) before the </a> (because here every char matches, except a </a> is following).
-
December 21st, 2009, 09:33 AM
#4
Re: String to long for Regex?
this "([ \\s]*)" is actually the same as "([\\s]*)" becase \s already matches whitespace characters
Last edited by memeloo; December 21st, 2009 at 09:37 AM.
win7 x86, VS 2008 & 2010, C++/CLI, C#, .NET 3.5 & 4.0, VB.NET, VBA... WPF is comming
remeber to give feedback you think my response deserves recognition? perhaps you may want to click the Rate this post link/button and add to my reputation
private lessons are not an option so please don't ask for help in private, I won't replay
if you use Opera and you'd like to have the tab-button functionality for the texteditor take a look at my Opera Tab-UserScirpt; and if you know how to stop firefox from jumping to the next control when you hit tab let me know
-
December 21st, 2009, 09:41 AM
#5
Re: String to long for Regex?
You are right, thank you.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|