[RESOLVED] How to Parse 1000 Files, Skipping "Bad" Lines?
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 5 of 5

Thread: [RESOLVED] How to Parse 1000 Files, Skipping "Bad" Lines?

  1. #1
    Join Date
    Apr 2010
    Posts
    19

    [RESOLVED] How to Parse 1000 Files, Skipping "Bad" Lines?

    Hi everyone,

    I'm a moderately experienced C++ programmer working on code which must do the following:
    (a) Import data from a lot of little files
    (b) Load that data into various objects
    (c) Do stuff with that data

    The code I've written does (a), (b), and (c) pretty well, but I've noticed a problem with (a), which I want to ask you guys about.

    Suppose I have 1000 source files. My program successfully processes Files #1 through #500. But when it reaches File #501, my program chokes and seg faults, and I automatically lose *ALL* the data I've collected. This is a big problem, because there is a LOT of data to process. It may take me three or four hours just to reach File #500.

    When the program reads a file, each individual line is loaded into a string called Line, which is then parsed for individual values. If I'm reading gdb right (output below), the parsing is causing the trouble. As for the line in the file which is causing the trouble, I don't see any format problems with the line itself. When I run the program multiple times, it is the same exact line which causes the seg fault every time.

    What would be awesome would be a way to tell the program, "if you see a line which confuses you, skip that line, don't just automatically crash!" Skipping the entire file would be okay too.

    Below is the code I'm using. Below that is the gdb analysis of why my program is choking. Any help or advice would be appreciated!


    ===================================================================
    ===================================================================
    My Code:


    Code:
    vector<string> ListOfFiles;
    string Line;
    vector<string> ValRow;
    
    // Load all the file names into ListOfFiles
    
    for(int i=0; i<ListOfFiles.size(); i++)
      {
        Line.clear();
        ifstream In_Flows((ListOfFiles[i]).c_str());
        while (getline(In_Flows, Line))
          {
            istringstream linestream(Line);
            ValRow.clear();
            while(getline(linestream, Value, ','))
              { ValRow.push_back(Value); }
          }
        // Load contents of ValRow into objects
      }
    ===================================================================
    ===================================================================
    My GDB Output:


    Program received signal SIGSEGV, Segmentation fault.
    0xff056b20 in realfree () from /lib/libc.so.1
    (gdb) bt
    #0 0xff056b20 in realfree () from /lib/libc.so.1
    #1 0xff0573d4 in cleanfree () from /lib/libc.so.1
    #2 0xff05652c in _malloc_unlocked () from /lib/libc.so.1
    #3 0xff05641c in malloc () from /lib/libc.so.1
    #4 0xff337734 in operator new () from /usr/local/lib/libstdc++.so.6
    #5 0xff318fe4 in std::string::_Rep::_S_create ()
    from /usr/local/lib/libstdc++.so.6
    #6 0xff3196e0 in std::string::_M_mutate () from /usr/local/lib/libstdc++.so.6
    #7 0xff30bd28 in std::getline<char, std::char_traits<char>, std::allocator<char> > () from /usr/local/lib/libstdc++.so.6
    #8 0x000199ac in ReadTheFile (PtrFlowInfoFile=0xffbffc48,
    PtrPrefixInfoFile=0xffbffc38, PtrRouterObjLibrary=0x41808, ATAFlag=true)
    at ReadTheFile.h:119
    #9 0x0001a2f0 in main (argc=2, argv=0xffbffccc) at Main.cpp:60
    (gdb) up
    #1 0xff0573d4 in cleanfree () from /lib/libc.so.1
    (gdb) up
    #2 0xff05652c in _malloc_unlocked () from /lib/libc.so.1
    (gdb) up
    #3 0xff05641c in malloc () from /lib/libc.so.1
    (gdb) up
    #4 0xff337734 in operator new () from /usr/local/lib/libstdc++.so.6
    (gdb) up
    #5 0xff318fe4 in std::string::_Rep::_S_create ()
    from /usr/local/lib/libstdc++.so.6
    (gdb) up
    #6 0xff3196e0 in std::string::_M_mutate () from /usr/local/lib/libstdc++.so.6
    (gdb) up
    #7 0xff30bd28 in std::getline<char, std::char_traits<char>, std::allocator<char> > () from /usr/local/lib/libstdc++.so.6
    (gdb) up
    #8 0x000199ac in ReadTheFile (PtrFlowInfoFile=0xffbffc48,
    PtrPrefixInfoFile=0xffbffc38, PtrRouterObjLibrary=0x41808, Flag=true)
    at ReadTheFile.h:119
    119 while(getline(linestream, Value, ','))
    (gdb) print Line
    $1 = {static npos = 4294967295,
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
    _M_p = 0xd756c "DataPoint0,DataPoint1,DataPoint2,DataPoint3,DataPoint4,DataPoint5,DataPoint6,DataPoint7,DataPoint8,DataPoint9"}}
    (gdb)

  2. #2
    GCDEF is offline Elite Member Power Poster
    Join Date
    Nov 2003
    Location
    Florida
    Posts
    12,204

    Re: How to Parse 1000 Files, Skipping "Bad" Lines?

    Does that file still crash if you run it by itself?

    Have you looked at it in a binary editor to see if there's anything weird in there?

    Have you tried stepping through the code when it crashes to see exactly what's going on?

  3. #3
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Fairfax, VA
    Posts
    10,891

    Re: How to Parse 1000 Files, Skipping "Bad" Lines?

    I would recommend running it through valgrind. An error in there usually indicates some type of corruption. Yes, valgrind will be much slower so you might not get all the way there in a reasonable amount of time.....but if there's a systemic problem, it might show up earlier in that in a nonfatal way which valgrind will still catch. Worth an overnight run, at least.

    Memory leaks are also a possibility. Monitor your program through top as it runs, and make sure the used memory isn't steadily climbing (more than expected) as it goes.

    It's also possible that you're simply trying to store too much data in memory at once, and that you're legitimately running out.

  4. #4
    Join Date
    Apr 2010
    Posts
    19

    Re: How to Parse 1000 Files, Skipping "Bad" Lines?

    Hi everyone, thanks for your advice. At this point, I suspect I am running up against a memory constraint of some kind. I'll retool the code to make sure I'm not littering the stack, hopefully that will fix this problem. Many thanks!

  5. #5
    Join Date
    Jun 2009
    Location
    France
    Posts
    2,339

    Re: [RESOLVED] How to Parse 1000 Files, Skipping "Bad" Lines?

    All I see is that your valRow variable can potentially become really huge. I'm sure a quick run monitoring it's size (or simply looking at system resources via your OS) should give you a quick answer of if it is a limited memory problem.

    I see nothing in your code that is wrong though or problematic though, except for valRow's potential size.

    I'd put
    Line as a local variable that is built/destroyed every cycle personally, as I hate objects over-extending their scope, but this is more of a matter of style than anything.
    Is your question related to IO?
    Read this C++ FAQ LITE article at parashift by Marshall Cline. In particular points 1-6.
    It will explain how to correctly deal with IO, how to validate input, and why you shouldn't count on "while(!in.eof())". And it always makes for excellent reading.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Windows Mobile Development Center


Click Here to Expand Forum to Full Width

This is a CodeGuru survey question.


Featured


HTML5 Development Center