-
May 18th, 2010, 01:00 PM
#1
[RESOLVED] How to Parse 1000 Files, Skipping "Bad" Lines?
Hi everyone,
I'm a moderately experienced C++ programmer working on code which must do the following:
(a) Import data from a lot of little files
(b) Load that data into various objects
(c) Do stuff with that data
The code I've written does (a), (b), and (c) pretty well, but I've noticed a problem with (a), which I want to ask you guys about.
Suppose I have 1000 source files. My program successfully processes Files #1 through #500. But when it reaches File #501, my program chokes and seg faults, and I automatically lose *ALL* the data I've collected. This is a big problem, because there is a LOT of data to process. It may take me three or four hours just to reach File #500.
When the program reads a file, each individual line is loaded into a string called Line, which is then parsed for individual values. If I'm reading gdb right (output below), the parsing is causing the trouble. As for the line in the file which is causing the trouble, I don't see any format problems with the line itself. When I run the program multiple times, it is the same exact line which causes the seg fault every time.
What would be awesome would be a way to tell the program, "if you see a line which confuses you, skip that line, don't just automatically crash!" Skipping the entire file would be okay too.
Below is the code I'm using. Below that is the gdb analysis of why my program is choking. Any help or advice would be appreciated!
===================================================================
===================================================================
My Code:
Code:
vector<string> ListOfFiles;
string Line;
vector<string> ValRow;
// Load all the file names into ListOfFiles
for(int i=0; i<ListOfFiles.size(); i++)
{
Line.clear();
ifstream In_Flows((ListOfFiles[i]).c_str());
while (getline(In_Flows, Line))
{
istringstream linestream(Line);
ValRow.clear();
while(getline(linestream, Value, ','))
{ ValRow.push_back(Value); }
}
// Load contents of ValRow into objects
}
===================================================================
===================================================================
My GDB Output:
Program received signal SIGSEGV, Segmentation fault.
0xff056b20 in realfree () from /lib/libc.so.1
(gdb) bt
#0 0xff056b20 in realfree () from /lib/libc.so.1
#1 0xff0573d4 in cleanfree () from /lib/libc.so.1
#2 0xff05652c in _malloc_unlocked () from /lib/libc.so.1
#3 0xff05641c in malloc () from /lib/libc.so.1
#4 0xff337734 in operator new () from /usr/local/lib/libstdc++.so.6
#5 0xff318fe4 in std::string::_Rep::_S_create ()
from /usr/local/lib/libstdc++.so.6
#6 0xff3196e0 in std::string::_M_mutate () from /usr/local/lib/libstdc++.so.6
#7 0xff30bd28 in std::getline<char, std::char_traits<char>, std::allocator<char> > () from /usr/local/lib/libstdc++.so.6
#8 0x000199ac in ReadTheFile (PtrFlowInfoFile=0xffbffc48,
PtrPrefixInfoFile=0xffbffc38, PtrRouterObjLibrary=0x41808, ATAFlag=true)
at ReadTheFile.h:119
#9 0x0001a2f0 in main (argc=2, argv=0xffbffccc) at Main.cpp:60
(gdb) up
#1 0xff0573d4 in cleanfree () from /lib/libc.so.1
(gdb) up
#2 0xff05652c in _malloc_unlocked () from /lib/libc.so.1
(gdb) up
#3 0xff05641c in malloc () from /lib/libc.so.1
(gdb) up
#4 0xff337734 in operator new () from /usr/local/lib/libstdc++.so.6
(gdb) up
#5 0xff318fe4 in std::string::_Rep::_S_create ()
from /usr/local/lib/libstdc++.so.6
(gdb) up
#6 0xff3196e0 in std::string::_M_mutate () from /usr/local/lib/libstdc++.so.6
(gdb) up
#7 0xff30bd28 in std::getline<char, std::char_traits<char>, std::allocator<char> > () from /usr/local/lib/libstdc++.so.6
(gdb) up
#8 0x000199ac in ReadTheFile (PtrFlowInfoFile=0xffbffc48,
PtrPrefixInfoFile=0xffbffc38, PtrRouterObjLibrary=0x41808, Flag=true)
at ReadTheFile.h:119
119 while(getline(linestream, Value, ','))
(gdb) print Line
$1 = {static npos = 4294967295,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
_M_p = 0xd756c "DataPoint0,DataPoint1,DataPoint2,DataPoint3,DataPoint4,DataPoint5,DataPoint6,DataPoint7,DataPoint8,DataPoint9"}}
(gdb)
-
May 18th, 2010, 01:07 PM
#2
Re: How to Parse 1000 Files, Skipping "Bad" Lines?
Does that file still crash if you run it by itself?
Have you looked at it in a binary editor to see if there's anything weird in there?
Have you tried stepping through the code when it crashes to see exactly what's going on?
-
May 18th, 2010, 01:18 PM
#3
Re: How to Parse 1000 Files, Skipping "Bad" Lines?
I would recommend running it through valgrind. An error in there usually indicates some type of corruption. Yes, valgrind will be much slower so you might not get all the way there in a reasonable amount of time.....but if there's a systemic problem, it might show up earlier in that in a nonfatal way which valgrind will still catch. Worth an overnight run, at least.
Memory leaks are also a possibility. Monitor your program through top as it runs, and make sure the used memory isn't steadily climbing (more than expected) as it goes.
It's also possible that you're simply trying to store too much data in memory at once, and that you're legitimately running out.
-
May 21st, 2010, 02:52 PM
#4
Re: How to Parse 1000 Files, Skipping "Bad" Lines?
Hi everyone, thanks for your advice. At this point, I suspect I am running up against a memory constraint of some kind. I'll retool the code to make sure I'm not littering the stack, hopefully that will fix this problem. Many thanks!
-
May 21st, 2010, 03:33 PM
#5
Re: [RESOLVED] How to Parse 1000 Files, Skipping "Bad" Lines?
All I see is that your valRow variable can potentially become really huge. I'm sure a quick run monitoring it's size (or simply looking at system resources via your OS) should give you a quick answer of if it is a limited memory problem.
I see nothing in your code that is wrong though or problematic though, except for valRow's potential size.
I'd put Line as a local variable that is built/destroyed every cycle personally, as I hate objects over-extending their scope, but this is more of a matter of style than anything.
Is your question related to IO?
Read this C++ FAQ article at parashift by Marshall Cline. In particular points 1-6.
It will explain how to correctly deal with IO, how to validate input, and why you shouldn't count on "while(!in.eof())". And it always makes for excellent reading.
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|