CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com

Thread: best way to deal with big array or vector?

1. Member
Join Date
Dec 2009
Posts
89

best way to deal with big array or vector?

Hi all,

I am trying c++ with some maths now, and I started with simple 2D vector (or array, or matrix, what ever you call it). What I want to do is: in the file below
Code:
```a1	12	17
a1	20	25
a1	21	26
a1	33	38
a1	35	40
a1	36	41
a1	51	56
a1	65	70
...```
each line is one "line", or "vector", with the second and third columns are x coordinates. The program read line by line, reads those two x-coordinates (say x1 and x2), and gives y at any point between those two xs additional one unit (so y(20) = y(21) = ... = y(25) = 1 for the first line). y of all points start at zero. You can see that y(21) = 2 since 21 appears in two intervals (two first lines), y(37) = 3 since 37 appears in three lines.

After reading the file, it should write to a new text file with the non-zero points only, point by line, like below:
Code:
```12	1
13	1
14	1
15	1
16	1
17	1
20	1
21	2
22	2
23	2
24	2
25	2
26	1
33	1
34	1
35	2
36	3
37	3
38	3
39	2
40	2
41	1
51	1
52	1
...```
Quite simple if the input file is small. The algorithm I came up with is:

- define array y from 0 to max of x, initial values are all zero.
- read the file, if x is from x1 to x2, then y(x)++;
- after done, write to file: from 0 -> x max, if y(x) is not zero, then write it down.

Problem is if x max is big, then the defined array would be big, and that is a waste of memory as well since a lot of y array is zero. Also, if x max is very big (millions to billions) then the algorithm would not work.

2. Elite Member Power Poster
Join Date
Oct 2007
Location
Seattle, WA
Posts
10,895

Re: best way to deal with big array or vector?

When your input is very sparse or over a potentially large range, consider using a std::map instead of an array or std::vector. In this case, defining y as a std::map<int,int> rather than an array should provide the desired behavior.

3. Member
Join Date
Dec 2009
Posts
89

Re: best way to deal with big array or vector?

Originally Posted by Lindley
When your input is very sparse or over a potentially large range, consider using a std::map instead of an array or std::vector. In this case, defining y as a std::map<int,int> rather than an array should provide the desired behavior.
std::map is new to me. Even the C++ book I have (C++ Primer Plus) does not have it. Would you mind giving me some examples and how to use it?

4. Member
Join Date
Dec 2009
Posts
89

Re: best way to deal with big array or vector?

Originally Posted by dukevn
std::map is new to me. Even the C++ book I have (C++ Primer Plus) does not have it. Would you mind giving me some examples and how to use it?
Nevermind. I got it. Below is the code I came up with:
Code:
```#include <map>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>

using namespace std;

int main() {
map<string, int> mapTest;
ifstream fin("in.txt");
string tempLine;
int XMax;
while ( getline(fin, tempLine, '\n') ) {
vector <string> tempString;
stringstream ss(tempLine);
string tempTab;
while ( getline(ss, tempTab, '\t') ) {
tempString.push_back(tempTab);
}
int x1 = atoi(tempString[1].c_str());
int x2 = atoi(tempString[2].c_str());
for (int i=0;i<=(x2-x1);i++) {
stringstream tempX;
tempX << x1 + i;
mapTest[tempX.str()]++;
}
XMax = x2;
}
fin.close();
ofstream fout("out.txt");
for ( int k=0;k<=XMax;k++ ) {
stringstream tempK;
tempK << k;
if ( mapTest[tempK.str()]!=0 ) {
fout << tempK.str() << "\t" << mapTest[tempK.str()] << endl;
}
}
fout.close();
return 0;
}```
I am gonna test it with a very big input file to see how it performances, but any recommendation / suggestion to improve the code to work with big file is welcome. Also, I need to see some performance's reports (CPU, usage, memory etc...), so any comments are greatly appreciated.

EDIT1: Nope, the code is not finished yet. I need to determine the max value of x to put in the for loop of k.

EDIT2: Code updated.
Last edited by dukevn; December 10th, 2009 at 09:41 PM.

5. Elite Member Power Poster
Join Date
Jan 2006
Location
Singapore
Posts
6,748

Re: best way to deal with big array or vector?

Originally Posted by dukevn
I am gonna test it with a very big input file to see how it performances, but any recommendation / suggestion to improve the code to work with big file is welcome. Also, I need to see some performance's reports (CPU, usage, memory etc...), so any comments are greatly appreciated.
I find it easier to write that input processing loop in this way:
Code:
```while (getline(fin, tempLine, '\n')) {
stringstream ss(tempLine);
string temp;
int x1, x2;
if (ss >> temp >> x1 >> x2) {
// Process the data.
// ...
} else {
// Handle input format error.
}
}```
I do not understand why you use strings as keys for the map when it seems like you want to use integers as keys.

6. Member
Join Date
Dec 2009
Posts
89

Re: best way to deal with big array or vector?

Originally Posted by laserlight
I do not understand why you use strings as keys for the map when it seems like you want to use integers as keys.
I defined map<int, int> but I got error when compiling, and did not understand why, so I googled a little bit, saw people use string, char etc. as well, so I tried with string. Using int seems to shorten the code, but I need to look back why I had error.
Originally Posted by laserlight
I find it easier to write that input processing loop in this way:
Code:
```while (getline(fin, tempLine, '\n')) {
stringstream ss(tempLine);
string temp;
int x1, x2;
if (ss >> temp >> x1 >> x2) {
// Process the data.
// ...
} else {
// Handle input format error.
}
}```
I will try your suggestion and report back. Thanks.

7. Member
Join Date
Dec 2009
Posts
89

Re: best way to deal with big array or vector?

New version with laserlight and Lindley suggestions:
Code:
```#include <map>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <time.h>

using namespace std;

int main( int argc, char *argv[] ) {
time_t start,end;
double diff;
time (&start);
map<int, int> mapTest;
ifstream fin(argv[1]);
string tempLine;
int XMax;
while ( getline(fin, tempLine, '\n') ) {
stringstream ss(tempLine);
string temp;
int x1, x2;
if ( ss >> temp >> x1 >> x2 ) {
for (int i=0;i<=(x2-x1);i++) {
mapTest[x1+i]++;
}
XMax = x2;
}
}
fin.close();
ofstream fout(argv[2]);
for ( int k=0;k<=XMax;k++ ) {
if ( mapTest[k]!=0 ) {
fout << k << "\t" << mapTest[k] << endl;
}
}
fout.close();
time (&end);
diff = difftime(end, start);
printf("Running time: %.2lf\n",diff);
return 0;
}```
For an input file of 84 MB, the new version is around 454s which is twice as fast as old version (924s) on a Macbook Pro Core 2 Duo 2.2 GHz and 4 GB of memory. Good news is that there is no memory error, but I still hope that I can speed it up (if possible).

Any idea how I can measure memory needed to run the code?

8. Elite Member Power Poster
Join Date
Jan 2006
Location
Singapore
Posts
6,748

Re: best way to deal with big array or vector?

It will not make much of a difference in efficiency, but for readability this:
Code:
```for (int i=0;i<=(x2-x1);i++) {
mapTest[x1+i]++;
}```
should be:
Code:
```for (int i = x1; i <= x2; ++i) {
++mapTest[i];
}```
This is inefficient:
Code:
```if ( mapTest[k]!=0 ) {
fout << k << "\t" << mapTest[k] << endl;
}```
I would expect:
Code:
```map<int, int>::iterator found = mapTest.find(k);
if (found != mapTest.end()) {
fout << k << "\t" << found->second << endl;
}```
However, why bother with searching? You might as well just iterate over the map:
Code:
```for (map<int, int>::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i) {
fout << i->first << '\t' << i->second << '\n';
}
fout.flush();```
This also renders your XMax variable redundant, so it can be removed along with the code to set it (which was not correct anyway, in my opinion, since it does not actually record the max).

9. Elite Member Power Poster
Join Date
Aug 2000
Location
West Virginia
Posts
7,712

Re: best way to deal with big array or vector?

Besides the above ...

If there is nothing else on the input lines, you can get rid of the stringstream
(which tends to be slow) ...

Code:
```  int x1, x2;

while ( fin >> tempLine >> x1 >> x2 )
{
for (int i=0;i<=(x2-x1);i++)
{
++mapTest[x1+i];
}
}```

10. Member
Join Date
Dec 2009
Posts
89

Re: best way to deal with big array or vector?

Originally Posted by laserlight
It will not make much of a difference in efficiency, but for readability this:
Code:
```for (int i=0;i<=(x2-x1);i++) {
mapTest[x1+i]++;
}```
should be:
Code:
```for (int i = x1; i <= x2; ++i) {
++mapTest[i];
}```
Honestly I dont understand ++x vs x++. What is the difference here?
Originally Posted by laserlight
This is inefficient:
Code:
```if ( mapTest[k]!=0 ) {
fout << k << "\t" << mapTest[k] << endl;
}```
I would expect:
Code:
```map<int, int>::iterator found = mapTest.find(k);
if (found != mapTest.end()) {
fout << k << "\t" << found->second << endl;
}```
However, why bother with searching? You might as well just iterate over the map:
Code:
```for (map<int, int>::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i) {
fout << i->first << '\t' << i->second << '\n';
}
fout.flush();```
This also renders your XMax variable redundant, so it can be removed along with the code to set it (which was not correct anyway, in my opinion, since it does not actually record the max).
I will try your suggestions. But why XMax is not correct? In my code, each line when being read, the second coordinate will be assigned as xmax, so will be the last-not-empty line, and that value should be the maximum of x. What am I missing here?

11. Member
Join Date
Dec 2009
Posts
89

Re: best way to deal with big array or vector?

Originally Posted by Philip Nicoletti
Besides the above ...

If there is nothing else on the input lines, you can get rid of the stringstream
(which tends to be slow) ...

Code:
```  int x1, x2;

while ( fin >> tempLine >> x1 >> x2 )
{
for (int i=0;i<=(x2-x1);i++)
{
++mapTest[x1+i];
}
}```
In the real input file, there are other things after the third column. Not sure about your suggestion, but I will try. Thanks.

12. Elite Member Power Poster
Join Date
Jan 2006
Location
Singapore
Posts
6,748

Re: best way to deal with big array or vector?

Originally Posted by dukevn
Honestly I dont understand ++x vs x++. What is the difference here?
In this case, because it is applied to a built-in type, there should be no difference in efficiency when used in standalone expressions. However, for class types a typical implementation of pre-increment is no less efficient than post-increment, and could even be slightly more efficient.

Originally Posted by dukevn
But why XMax is not correct? In my code, each line when being read, the second coordinate will be assigned as xmax, so will be the last-not-empty line, and that value should be the maximum of x. What am I missing here?
Consider this input:
Code:
```a1	65	70
a1	12	17```
The max would then be recorded as 17, not 70.

13. Member
Join Date
Dec 2009
Posts
89

Re: best way to deal with big array or vector?

Originally Posted by laserlight
In this case, because it is applied to a built-in type, there should be no difference in efficiency when used in standalone expressions. However, for class types a typical implementation of pre-increment is no less efficient than post-increment, and could even be slightly more efficient.
I see, thanks.
Originally Posted by laserlight
Consider this input:
Code:
```a1	65	70
a1	12	17```
The max would then be recorded as 17, not 70.
Ah, yeah. In the original file, you would not see such inputs, since it is pre-sorted by the second column (and I forgot to mention that, my mistake ). Anyway, you pointed out is great, cause then now the code will work with unsorted file as well.

By the way, what is difference between fout.flush(); and fout.close();?

14. Elite Member Power Poster
Join Date
Jan 2006
Location
Singapore
Posts
6,748

Re: best way to deal with big array or vector?

Originally Posted by dukevn
By the way, what is difference between fout.flush(); and fout.close();?
One forces output to be written, the other forces output to be written and then closes the output stream.

15. Member
Join Date
Dec 2009
Posts
89

Re: best way to deal with big array or vector?

Originally Posted by laserlight
One forces output to be written, the other forces output to be written and then closes the output stream.
Afterward I still have to close the file, or I dont? If using flush only, then fout is still not closed when the code finishes?

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•