[RESOLVED] Help parsing gigantic text files (64 MB+ )

**DeepT** · July 23rd, 2009, 11:10 AM

I have these giant text files I want to parse. I am looking for a FAST method and was considering using RegEx. However let me give you some sample data here.

If I load the text file into notepad and do some simple string searches, it can take 20 or 30 seconds or more to find a hit.

If I load this into a test C# program I have, load the text file into a string and use some simple RegEx expressions flagged as multi-line, it can take a long time too, although it is faster then notepad.

At this point I would say, well it is a giant text file it takes some time except for the next part:

There is this application someone told me to check out, TextPad. So I download and install this and load up my text file, which it does instantly (notepad grinds for a while). Then I do the same string searches and it is instant. I can say, find all occurrences of X and book mark them. BAM! its done.

So there *is* a way, somehow. Maybe someone here knows how they do it or *THE WAY*, or maybe not.

What I would like advise on, is an efficient method of parsing text files with an eye towards speed. For example, maybe the RegEx engine is just a poor choice for large files. Or perhaps the "Multi-line" property is inefficient and maybe I should store my file as a series of lines (as strings) rather than a single gigantic string and iterate over each line.

Anyway, this is kind of green field, I am open for suggestions.

**darwen** · July 23rd, 2009, 11:33 AM

What I would like advise on, is an efficient method of parsing text files with an eye towards speed.

You are assuming that TextPad is written in .NET - probably not. It's probably written in C++.

String manipulation is one of the things that .NET (in my opinion) is very slow at.

The problem lies with the fact that strings are immutable, so if you want a substring you have to do a block memory copy into a new string instance.

But not only that, but the bounds checking also gets in the way.

C++ on the other hand is very good at this sort of thing, and extremely quick too if you avoid STL and do things using char pointers (in my experience).

So how do you solve your probem ? Have you tried string.IndexOf to find simple strings. This might be fast enough - chances are you don't need all the advanced features of a regex.

If you do need to do a find which is more advanced you might just be stuck with the speed unless you switch to using C++ for the find.

Darwen.

**DeepT** · July 23rd, 2009, 12:28 PM

I am not changing the string, it is immutable. I am reading, not writing.

Yes, Textpad is C++. However I am sure notepad is also C++. Notepad is slow, TextPad is fast.

**monalin** · July 23rd, 2009, 12:32 PM

Originally Posted by DeepT

I am not changing the string, it is immutable. I am reading, not writing.

Yes, Textpad is C++. However I am sure notepad is also C++. Notepad is slow, TextPad is fast.

I'm sure that firefox, chrome, and internet explorer are all made in C++ as well. But they all run at different speeds. The language can have the capability to be fast, but without proper code its no more efficient than C#.

**Alsvha** · July 23rd, 2009, 12:38 PM

Use stream readers and stream writers and read segments into memory, handle what needs t be handled in that segment, and then write them back to disk.
The specifics of how much you need to read into memory depends fully on what you need to do with the file.
I've used this method myself on XML files in excess of 800MB and it runs quite fast (much faster then your reported Notepad numbers), so 64MB is hardly gigantic

And if performance is an issue, perhaps regular expression isn't the way to go depending on what you actually need to do.

**BigEd781** · July 23rd, 2009, 01:32 PM

Originally Posted by DeepT

I am not changing the string, it is immutable. I am reading, not writing.

Yes, Textpad is C++. However I am sure notepad is also C++. Notepad is slow, TextPad is fast.

You may very well be creating many temporary strings while searching though. You are not changing the data you are looking at, but that does not mean that you are not creating strings in your code. Also, you may not realize that the library functions that you are using are also creating new string objects.

**Mutant_Fruit** · July 23rd, 2009, 02:54 PM

In the top 10 reasons why this is fast in one program and slow in another, the language used to program the app doesn't even appear [0]. That's how little it matters.

There are an unbelievable number of ways to implement this kind of thing. Regex is likely to be the fastest and easiest way. The other thing is that this task is highly parallelisable. You can easily split the input text into X chunks where X is the number of logical cores on your system and process each chunk independently and merge the results at the end. This would give a near linear speedup on large datasets.

Show us what you implemented, then we can comment on how to make it better.

[0] Just as another datapoint, have you ever noticed how god-awful slow Firefox is for searching large pages? It's hideous! Guess what language Firefox is written in

**DeepT** · July 23rd, 2009, 03:08 PM

I have found out this so far:

If "string" = Gigantic text file, then a "multi-line" RegEx.Matches( string ) is MUCH SLOWER

then

List<string> TextFile; where each line of the TextFile is its own string and then

For J ... RegEx.Matches( TextFile[J] );

One set of RegEx does take a while, but not nearly as long as doing it in one big chunk. However it is still slow and I may end up with 50 or 60 or MORE regexs by the time I am done.

I do not know what TextPad does, but the file might as well be a 100k text file, because it is so fast.

**Mutant_Fruit** · July 23rd, 2009, 03:21 PM

I'm searching a 250MB text file for all occurances of "setup" and that takes 1.1 seconds. Making it case insensitive adds an additional 400ms. This test would probably execute faster on an SSD as it's more than likely disk IO limited.

How slow is your search and can we see the code?

EDIT: So i checked, if I preload the entire file into memory then regex it, the search takes 125 milliseconds (for 250 megabytes). Making it case insensitive adds 400ms (as before), resulting in ~500ms for the entire search in this scenario.

**DeepT** · July 23rd, 2009, 04:12 PM

I am looking at much more complex strings then "setup" and I am not doing a find, but a RegEx. I can't really upload the text file I am using either

. I dunno, if you can get a complex web-based application log file that is 64 or more megs, maybe that might be similar enough. Then find 3 or 4 events that happen and have some program parse the log and print the line (along with the time-stamp) of each event. It is only taking a 3 or 4 seconds, doing the way I am doing it now.

The original single RegEx took the same time, but currently I am now up to 5 RegEx and the total time has not changed much. Doing it the old way, then I would assume that each RegEx would take 5 seconds (for example) which would just stack the more I have.

Ill really see how well my current method scales and once I get 20 or 30, maybe the time will be more noticeable.

Here is the original RegEx string I used on my text file that took a good while:
"^.*Build (\d+) Startup"

Setting the RegEx for multi-line, and if a file had 10 occurrences in it, then you have an idea of the kind of distribution in the file. That is much more complex then looking for a simple word.

**DeepT** · July 23rd, 2009, 04:25 PM

Actually I added another RegEx and it is now taking a very long time. I think that the complexity of the expression greatly affects the time it takes. So maybe RegEx are NOT the way to go. Or I need to figure out what is "expensive" and what is not with regards to RegEx. This latest one took 4 minutes to run.

**Mutant_Fruit** · July 23rd, 2009, 04:33 PM

So then maybe it's your regexes that need to be looked at

"^.*Build (\d+) Startup"

So the above regex attempts to match any character zero or more times followed by the literal "Build " which is followed by one or more integers, which is folloed by " Startup".

Surely that's the exact same as:

"Build (\d+) Startup".

Would the reduced regex perform faster? While it is possible to create regexes which take an extremely long time to run, it's unlikely. Generally speaking if you're not using backreferences, it's pretty hard to make a *really* slow regex. If you want, I wouldn't mind looking at your regexes to see if I could spot any obvious issues.

Finally, i assume you are reading the source file in just once, right? Or are you reading it once for each regex you want to use?

**DeepT** · July 23rd, 2009, 04:40 PM

Its one file, I read it in once. The reason I do not use the one you used is because I want the entire line, in that case. Not in all cases, but in that case, but yes, the RegEx are killing me. I just made one, and I think it will finish sometime after the Sun runs out of Hydrogen.

So, is there a source someone can point me to that will highlight bad/slow RegEx vs fast ones? Ill have to try some other strategies, maybe using something that simply says, "This line matches" but then not using the Match, but using the original line for other processing after the fact that is too expensive to do on every line with a RegEx.

I think anything with .* has to be a really bad idea in terms of speed.

**Mutant_Fruit** · July 23rd, 2009, 04:52 PM

Originally Posted by DeepT

Its one file, I read it in once. The reason I do not use the one you used is because I want the entire line, in that case.

Oh so easy

Code:

string current;
string data = File.ReadAllText (inputFile); // Read the entire file to memory
List <string> matches = new List <string> ();
Regex regex = new Regex ("Build (\d+) Startup");
using (StringReader reader = new StringReader (data))
    while ((current = reader.ReadLine () != null)
        if (regex.IsMatch (current))
            matches.Add (current);

That'll probably finish about 100x faster than your original regex and give exactly the same final output.

I think anything with .* has to be a really bad idea in terms of speed.

Yup, i tend to agree.

EDIT: You could extend this approach to read the file line by line and pass each individual line to all your regexes before reading the next line. This approach has the benefit that you don't have to read the entire file into memory.

**DeepT** · July 24th, 2009, 08:04 AM

I do read the entire file in line by line, and store it in a list, line by line. I then run that list against another list of RegExs. Currently it is still slow, although faster then before. Mostly when I have a RegEx where I want it to match Alpha, but not Beta.
IE:

"Blah Blah Alpha foo loo" - match
"Blah Alpha Foo Loo Doo Beta Zoo" - no match.

Those are still slow.

Thread: [RESOLVED] Help parsing gigantic text files (64 MB+ )

Thread Tools

Display

[RESOLVED] Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Re: Help parsing gigantic text files (64 MB+ )

Posting Permissions