best way to deal with big array or vector?

**Lindley** · December 11th, 2009, 11:42 AM

All streams are closed when their object is destroyed. Typically this happens when they go out of scope.

**laserlight** · December 11th, 2009, 11:51 AM

Originally Posted by dukevn

Afterward I still have to close the file, or I dont?

It will be closed when the stream object is destroyed when it goes out of scope. Of course, if the end of the scope is not near, it is good to explicitly close it anyway.

**dukevn** · December 11th, 2009, 12:13 PM

Originally Posted by Lindley

All streams are closed when their object is destroyed. Typically this happens when they go out of scope.

Originally Posted by laserlight

It will be closed when the stream object is destroyed when it goes out of scope. Of course, if the end of the scope is not near, it is good to explicitly close it anyway.

So it is fine if I use flush, but it is good (and concise) if I use close, right? Is flush more efficient than close?

**laserlight** · December 11th, 2009, 12:19 PM

Originally Posted by dukevn

So it is fine if I use flush, but it is good (and concise) if I use close, right? Is flush more efficient than close?

It is up to use. I used flush() in my example because that is what std::endl does in addition to writing a newline, and my intention was to demonstrate that there is no need to flush the stream on each iteration. You just need to flush once at the end, and you may not even need to do that explicitly.

**dukevn** · December 11th, 2009, 12:59 PM

Originally Posted by laserlight

It is up to use. I used flush() in my example because that is what std::endl does in addition to writing a newline, and my intention was to demonstrate that there is no need to flush the stream on each iteration. You just need to flush once at the end, and you may not even need to do that explicitly.

I did not know of flush() before your example

, and I did not know that I have to (or should?) flush any stream? But I do remember that I did not have error or warning if I forgot closing a file.

**Philip Nicoletti** · December 11th, 2009, 01:01 PM

Originally Posted by dukevn

In the real input file, there are other things after the third column. Not sure about your suggestion, but I will try. Thanks.

In that case, you would need to do the following (will work if there are other
things after the third column ... or if there are only 3 columns):

Code:

  while ( fin >> tempLine >> x1 >> x2 ) 
  {
      getline(fin,tempLine);
      for (int i=0;i<=(x2-x1);i++) 
      {
        ++mapTest[x1+i];
      }
  }

**laserlight** · December 11th, 2009, 01:07 PM

Originally Posted by Philip Nicoletti

In that case, you would need to do the following

Good point. It may be more explanatory to use the ignore() member function though.

**dukevn** · December 11th, 2009, 01:22 PM

Originally Posted by Philip Nicoletti

In that case, you would need to do the following (will work if there are other
things after the third column ... or if there are only 3 columns):

Code:

  while ( fin >> tempLine >> x1 >> x2 ) 
  {
      getline(fin,tempLine);
      for (int i=0;i<=(x2-x1);i++) 
      {
        ++mapTest[x1+i];
      }
  }

You got it just right Philip. I am wondering why

Code:

  while ( fin >> tempLine >> x1 >> x2 ) {
    string temp;
    int x1, x2;
    if ( fin >> temp >> x1 >> x2 ) {
      for ( int i = x1; i <= x2; ++i ) {
        ++mapTest[i];
      }
    }
  }

neglects the first input line, then you shed a light for me

. Testing them now, and I will report back the results.

**dukevn** · December 11th, 2009, 01:23 PM

Originally Posted by laserlight

It may be more explanatory to use the ignore() member function though.

Would you mind giving me more explanation? How do I use ignore()?

**dukevn** · December 11th, 2009, 01:46 PM

Originally Posted by dukevn

Testing them now, and I will report back the results.

OK here is the reports with an input file of 1.06GB on a cluster node of 8 cores:

- Original code: 36m2.169s
- Improved v.1 code: 12m40.918s
- Improved v.2 (without XMax): 11m39.172s
- Final code v.3 (without string stream): 11m35.974s

So there is no much difference between the last three versions (but three times as fast as the original one - a great improvement). One thing I am aiming now is how to make use of the multi-core advantage (right now the code runs only on one core), but it seems to be not that easy.

Thanks for all of your helps.

**laserlight** · December 11th, 2009, 01:49 PM

Originally Posted by dukevn

Would you mind giving me more explanation? How do I use ignore()?

Suppose according to the input format there will be a tab character between the first field and the second field. You could dispense with the temporary string variable that was used to ignore input:

Code:

int x1, x2;
while (fin.ignore(1000, '\t') && (fin >> x1 >> x2))
{
    fin.ignore(1000, '\n');
    for (int i = x1; i <= x2; ++i) 
    {
        ++mapTest[i];
    }
}

where 1000 is arbitrarily chosen. You could have used std::numeric_limits<std::streamsize>::max() instead.

**Lindley** · December 11th, 2009, 01:49 PM

flush() is something you usually don't need to call yourself. It usually gets handled automatically. But it's available because occasionally you do need to call it explicitly.

Multiple cores are not going to help much for file parsing. Everything bottlenecks through the disk controller anyway. Typically, the biggest multi-core gain comes when you're doing heavy mathematical computations in main memory.

**dukevn** · December 11th, 2009, 03:56 PM

Originally Posted by laserlight

Suppose according to the input format there will be a tab character between the first field and the second field. You could dispense with the temporary string variable that was used to ignore input:

Code:

int x1, x2;
while (fin.ignore(1000, '\t') && (fin >> x1 >> x2))
{
    fin.ignore(1000, '\n');
    for (int i = x1; i <= x2; ++i) 
    {
        ++mapTest[i];
    }
}

where 1000 is arbitrarily chosen. You could have used std::numeric_limits<std::streamsize>::max() instead.

Got it. Thanks laserlight.

**dukevn** · December 11th, 2009, 03:57 PM

Originally Posted by Lindley

Multiple cores are not going to help much for file parsing. Everything bottlenecks through the disk controller anyway. Typically, the biggest multi-core gain comes when you're doing heavy mathematical computations in main memory.

Are you saying that splitting input file to 8 chunks, processing those 8 chunks in parallel will not help at all?

**VladimirF** · December 11th, 2009, 05:17 PM

Originally Posted by dukevn

OK here is the reports with an input file of 1.06GB on a cluster node of 8 cores:

- Original code: 36m2.169s
- Improved v.1 code: 12m40.918s
- Improved v.2 (without XMax): 11m39.172s
- Final code v.3 (without string stream): 11m35.974s

So there is no much difference between the last three versions (but three times as fast as the original one - a great improvement).

My guess is that code like that should execute at the speed of file I/O.
I know I can copy a 1GB file in about 30 seconds, so 11 minutes sounds like WAY too much.
Could you comment out everything in your code except for I/O and see how long that takes?
Do you mind posting your code and a sample data file?

Thread: best way to deal with big array or vector?

Thread Tools

Display

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Re: best way to deal with big array or vector?

Posting Permissions