best way to deal with big array or vector?
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Page 1 of 4 1234 LastLast
Results 1 to 15 of 54

Thread: best way to deal with big array or vector?

  1. #1
    Join Date
    Dec 2009
    Posts
    89

    best way to deal with big array or vector?

    Hi all,

    I am trying c++ with some maths now, and I started with simple 2D vector (or array, or matrix, what ever you call it). What I want to do is: in the file below
    Code:
    a1	12	17
    a1	20	25
    a1	21	26
    a1	33	38
    a1	35	40
    a1	36	41
    a1	51	56
    a1	65	70
    ...
    each line is one "line", or "vector", with the second and third columns are x coordinates. The program read line by line, reads those two x-coordinates (say x1 and x2), and gives y at any point between those two xs additional one unit (so y(20) = y(21) = ... = y(25) = 1 for the first line). y of all points start at zero. You can see that y(21) = 2 since 21 appears in two intervals (two first lines), y(37) = 3 since 37 appears in three lines.

    After reading the file, it should write to a new text file with the non-zero points only, point by line, like below:
    Code:
    12	1
    13	1
    14	1
    15	1
    16	1
    17	1
    20	1
    21	2
    22	2
    23	2
    24	2
    25	2
    26	1
    33	1
    34	1
    35	2
    36	3
    37	3
    38	3
    39	2
    40	2
    41	1
    51	1
    52	1
    ...
    Quite simple if the input file is small. The algorithm I came up with is:

    - define array y from 0 to max of x, initial values are all zero.
    - read the file, if x is from x1 to x2, then y(x)++;
    - after done, write to file: from 0 -> x max, if y(x) is not zero, then write it down.

    Problem is if x max is big, then the defined array would be big, and that is a waste of memory as well since a lot of y array is zero. Also, if x max is very big (millions to billions) then the algorithm would not work.

    Anybody please help!

  2. #2
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Fairfax, VA
    Posts
    10,888

    Re: best way to deal with big array or vector?

    When your input is very sparse or over a potentially large range, consider using a std::map instead of an array or std::vector. In this case, defining y as a std::map<int,int> rather than an array should provide the desired behavior.

  3. #3
    Join Date
    Dec 2009
    Posts
    89

    Re: best way to deal with big array or vector?

    Quote Originally Posted by Lindley View Post
    When your input is very sparse or over a potentially large range, consider using a std::map instead of an array or std::vector. In this case, defining y as a std::map<int,int> rather than an array should provide the desired behavior.
    std::map is new to me. Even the C++ book I have (C++ Primer Plus) does not have it. Would you mind giving me some examples and how to use it?

  4. #4
    Join Date
    Dec 2009
    Posts
    89

    Re: best way to deal with big array or vector?

    Quote Originally Posted by dukevn View Post
    std::map is new to me. Even the C++ book I have (C++ Primer Plus) does not have it. Would you mind giving me some examples and how to use it?
    Nevermind. I got it. Below is the code I came up with:
    Code:
    #include <map>
    #include <vector>
    #include <stdio.h>
    #include <stdlib.h>
    #include <fstream>
    #include <iostream>
    #include <sstream>
    #include <string>
    
    using namespace std;
    
    int main() {
      map<string, int> mapTest;
      ifstream fin("in.txt");
      string tempLine;
      int XMax;
      while ( getline(fin, tempLine, '\n') ) {
        vector <string> tempString;
        stringstream ss(tempLine);
        string tempTab;
        while ( getline(ss, tempTab, '\t') ) {
          tempString.push_back(tempTab);
        }
        int x1 = atoi(tempString[1].c_str());
        int x2 = atoi(tempString[2].c_str());
        for (int i=0;i<=(x2-x1);i++) {
          stringstream tempX;
          tempX << x1 + i;
          mapTest[tempX.str()]++;
        }
        XMax = x2;
      }
      fin.close();
      ofstream fout("out.txt");
      for ( int k=0;k<=XMax;k++ ) {
        stringstream tempK;
        tempK << k;
        if ( mapTest[tempK.str()]!=0 ) {
          fout << tempK.str() << "\t" << mapTest[tempK.str()] << endl;
        }
      }
      fout.close();
      return 0;
    }
    I am gonna test it with a very big input file to see how it performances, but any recommendation / suggestion to improve the code to work with big file is welcome. Also, I need to see some performance's reports (CPU, usage, memory etc...), so any comments are greatly appreciated.

    EDIT1: Nope, the code is not finished yet. I need to determine the max value of x to put in the for loop of k.

    EDIT2: Code updated.
    Last edited by dukevn; December 10th, 2009 at 09:41 PM.

  5. #5
    Join Date
    Jan 2006
    Location
    Singapore
    Posts
    6,261

    Re: best way to deal with big array or vector?

    Quote Originally Posted by dukevn
    I am gonna test it with a very big input file to see how it performances, but any recommendation / suggestion to improve the code to work with big file is welcome. Also, I need to see some performance's reports (CPU, usage, memory etc...), so any comments are greatly appreciated.
    I find it easier to write that input processing loop in this way:
    Code:
    while (getline(fin, tempLine, '\n')) {
      stringstream ss(tempLine);
      string temp;
      int x1, x2;
      if (ss >> temp >> x1 >> x2) {
        // Process the data.
        // ...
      } else {
        // Handle input format error.
      }
    }
    I do not understand why you use strings as keys for the map when it seems like you want to use integers as keys.
    C + C++ Compiler: MinGW port of GCC
    Build + Version Control System: SCons + Bazaar

    Look up a C/C++ Reference and learn How To Ask Questions The Smart Way
    Kindly rate my posts if you found them useful

  6. #6
    Join Date
    Dec 2009
    Posts
    89

    Re: best way to deal with big array or vector?

    Quote Originally Posted by laserlight View Post
    I do not understand why you use strings as keys for the map when it seems like you want to use integers as keys.
    I defined map<int, int> but I got error when compiling, and did not understand why, so I googled a little bit, saw people use string, char etc. as well, so I tried with string. Using int seems to shorten the code, but I need to look back why I had error.
    Quote Originally Posted by laserlight View Post
    I find it easier to write that input processing loop in this way:
    Code:
    while (getline(fin, tempLine, '\n')) {
      stringstream ss(tempLine);
      string temp;
      int x1, x2;
      if (ss >> temp >> x1 >> x2) {
        // Process the data.
        // ...
      } else {
        // Handle input format error.
      }
    }
    I will try your suggestion and report back. Thanks.

  7. #7
    Join Date
    Dec 2009
    Posts
    89

    Re: best way to deal with big array or vector?

    New version with laserlight and Lindley suggestions:
    Code:
    #include <map>
    #include <vector>
    #include <stdio.h>
    #include <stdlib.h>
    #include <fstream>
    #include <iostream>
    #include <sstream>
    #include <string>
    #include <time.h>
    
    using namespace std;
    
    int main( int argc, char *argv[] ) {
      time_t start,end;
      double diff;
      time (&start);
      map<int, int> mapTest;
      ifstream fin(argv[1]);
      string tempLine;
      int XMax;
      while ( getline(fin, tempLine, '\n') ) {
        stringstream ss(tempLine);
        string temp;
        int x1, x2;
        if ( ss >> temp >> x1 >> x2 ) {
          for (int i=0;i<=(x2-x1);i++) {
            mapTest[x1+i]++;
          }
          XMax = x2;
        }
      }
      fin.close();
      ofstream fout(argv[2]);
      for ( int k=0;k<=XMax;k++ ) {
        if ( mapTest[k]!=0 ) {
          fout << k << "\t" << mapTest[k] << endl;
        }
      }
      fout.close();
      time (&end);
      diff = difftime(end, start);
      printf("Running time: %.2lf\n",diff);
      return 0;
    }
    For an input file of 84 MB, the new version is around 454s which is twice as fast as old version (924s) on a Macbook Pro Core 2 Duo 2.2 GHz and 4 GB of memory. Good news is that there is no memory error, but I still hope that I can speed it up (if possible).

    Any idea how I can measure memory needed to run the code?

  8. #8
    Join Date
    Jan 2006
    Location
    Singapore
    Posts
    6,261

    Re: best way to deal with big array or vector?

    It will not make much of a difference in efficiency, but for readability this:
    Code:
    for (int i=0;i<=(x2-x1);i++) {
      mapTest[x1+i]++;
    }
    should be:
    Code:
    for (int i = x1; i <= x2; ++i) {
      ++mapTest[i];
    }
    This is inefficient:
    Code:
    if ( mapTest[k]!=0 ) {
      fout << k << "\t" << mapTest[k] << endl;
    }
    I would expect:
    Code:
    map<int, int>::iterator found = mapTest.find(k);
    if (found != mapTest.end()) {
      fout << k << "\t" << found->second << endl;
    }
    However, why bother with searching? You might as well just iterate over the map:
    Code:
    for (map<int, int>::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i) {
      fout << i->first << '\t' << i->second << '\n';
    }
    fout.flush();
    This also renders your XMax variable redundant, so it can be removed along with the code to set it (which was not correct anyway, in my opinion, since it does not actually record the max).
    C + C++ Compiler: MinGW port of GCC
    Build + Version Control System: SCons + Bazaar

    Look up a C/C++ Reference and learn How To Ask Questions The Smart Way
    Kindly rate my posts if you found them useful

  9. #9
    Join Date
    Aug 2000
    Location
    West Virginia
    Posts
    7,564

    Re: best way to deal with big array or vector?

    Besides the above ...

    If there is nothing else on the input lines, you can get rid of the stringstream
    (which tends to be slow) ...

    Code:
      int x1, x2;
    
      while ( fin >> tempLine >> x1 >> x2 ) 
      {
          for (int i=0;i<=(x2-x1);i++) 
          {
           ++mapTest[x1+i];
          }
      }

  10. #10
    Join Date
    Dec 2009
    Posts
    89

    Re: best way to deal with big array or vector?

    Quote Originally Posted by laserlight View Post
    It will not make much of a difference in efficiency, but for readability this:
    Code:
    for (int i=0;i<=(x2-x1);i++) {
      mapTest[x1+i]++;
    }
    should be:
    Code:
    for (int i = x1; i <= x2; ++i) {
      ++mapTest[i];
    }
    Honestly I dont understand ++x vs x++. What is the difference here?
    Quote Originally Posted by laserlight View Post
    This is inefficient:
    Code:
    if ( mapTest[k]!=0 ) {
      fout << k << "\t" << mapTest[k] << endl;
    }
    I would expect:
    Code:
    map<int, int>::iterator found = mapTest.find(k);
    if (found != mapTest.end()) {
      fout << k << "\t" << found->second << endl;
    }
    However, why bother with searching? You might as well just iterate over the map:
    Code:
    for (map<int, int>::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i) {
      fout << i->first << '\t' << i->second << '\n';
    }
    fout.flush();
    This also renders your XMax variable redundant, so it can be removed along with the code to set it (which was not correct anyway, in my opinion, since it does not actually record the max).
    I will try your suggestions. But why XMax is not correct? In my code, each line when being read, the second coordinate will be assigned as xmax, so will be the last-not-empty line, and that value should be the maximum of x. What am I missing here?

  11. #11
    Join Date
    Dec 2009
    Posts
    89

    Re: best way to deal with big array or vector?

    Quote Originally Posted by Philip Nicoletti View Post
    Besides the above ...

    If there is nothing else on the input lines, you can get rid of the stringstream
    (which tends to be slow) ...

    Code:
      int x1, x2;
    
      while ( fin >> tempLine >> x1 >> x2 ) 
      {
          for (int i=0;i<=(x2-x1);i++) 
          {
           ++mapTest[x1+i];
          }
      }
    In the real input file, there are other things after the third column. Not sure about your suggestion, but I will try. Thanks.

  12. #12
    Join Date
    Jan 2006
    Location
    Singapore
    Posts
    6,261

    Re: best way to deal with big array or vector?

    Quote Originally Posted by dukevn
    Honestly I dont understand ++x vs x++. What is the difference here?
    In this case, because it is applied to a built-in type, there should be no difference in efficiency when used in standalone expressions. However, for class types a typical implementation of pre-increment is no less efficient than post-increment, and could even be slightly more efficient.

    Quote Originally Posted by dukevn
    But why XMax is not correct? In my code, each line when being read, the second coordinate will be assigned as xmax, so will be the last-not-empty line, and that value should be the maximum of x. What am I missing here?
    Consider this input:
    Code:
    a1	65	70
    a1	12	17
    The max would then be recorded as 17, not 70.
    C + C++ Compiler: MinGW port of GCC
    Build + Version Control System: SCons + Bazaar

    Look up a C/C++ Reference and learn How To Ask Questions The Smart Way
    Kindly rate my posts if you found them useful

  13. #13
    Join Date
    Dec 2009
    Posts
    89

    Re: best way to deal with big array or vector?

    Quote Originally Posted by laserlight View Post
    In this case, because it is applied to a built-in type, there should be no difference in efficiency when used in standalone expressions. However, for class types a typical implementation of pre-increment is no less efficient than post-increment, and could even be slightly more efficient.
    I see, thanks.
    Quote Originally Posted by laserlight View Post
    Consider this input:
    Code:
    a1	65	70
    a1	12	17
    The max would then be recorded as 17, not 70.
    Ah, yeah. In the original file, you would not see such inputs, since it is pre-sorted by the second column (and I forgot to mention that, my mistake ). Anyway, you pointed out is great, cause then now the code will work with unsorted file as well.

    By the way, what is difference between fout.flush(); and fout.close();?

  14. #14
    Join Date
    Jan 2006
    Location
    Singapore
    Posts
    6,261

    Re: best way to deal with big array or vector?

    Quote Originally Posted by dukevn
    By the way, what is difference between fout.flush(); and fout.close();?
    One forces output to be written, the other forces output to be written and then closes the output stream.
    C + C++ Compiler: MinGW port of GCC
    Build + Version Control System: SCons + Bazaar

    Look up a C/C++ Reference and learn How To Ask Questions The Smart Way
    Kindly rate my posts if you found them useful

  15. #15
    Join Date
    Dec 2009
    Posts
    89

    Re: best way to deal with big array or vector?

    Quote Originally Posted by laserlight View Post
    One forces output to be written, the other forces output to be written and then closes the output stream.
    Afterward I still have to close the file, or I dont? If using flush only, then fout is still not closed when the code finishes?

Page 1 of 4 1234 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Azure Activities Information Page

Windows Mobile Development Center


Click Here to Expand Forum to Full Width

This is a CodeGuru survey question.


Featured


HTML5 Development Center