best way to deal with big array or vector?

Printable View

Show 50 post(s) from this thread on one page

December 10th, 2009, 06:22 PM
dukevn

best way to deal with big array or vector?

Hi all,

I am trying c++ with some maths now, and I started with simple 2D vector (or array, or matrix, what ever you call it). What I want to do is: in the file below

Code:

a1 12 17 a1 20 25 a1 21 26 a1 33 38 a1 35 40 a1 36 41 a1 51 56 a1 65 70 ...

each line is one "line", or "vector", with the second and third columns are x coordinates. The program read line by line, reads those two x-coordinates (say x1 and x2), and gives y at any point between those two xs additional one unit (so y(20) = y(21) = ... = y(25) = 1 for the first line). y of all points start at zero. You can see that y(21) = 2 since 21 appears in two intervals (two first lines), y(37) = 3 since 37 appears in three lines.

After reading the file, it should write to a new text file with the non-zero points only, point by line, like below:

Code:

12 1 13 1 14 1 15 1 16 1 17 1 20 1 21 2 22 2 23 2 24 2 25 2 26 1 33 1 34 1 35 2 36 3 37 3 38 3 39 2 40 2 41 1 51 1 52 1 ...

Quite simple if the input file is small. The algorithm I came up with is:

- define array y from 0 to max of x, initial values are all zero.
- read the file, if x is from x1 to x2, then y(x)++;
- after done, write to file: from 0 -> x max, if y(x) is not zero, then write it down.

Problem is if x max is big, then the defined array would be big, and that is a waste of memory as well since a lot of y array is zero. Also, if x max is very big (millions to billions) then the algorithm would not work.

Anybody please help!
December 10th, 2009, 06:44 PM
Lindley

Re: best way to deal with big array or vector?

When your input is very sparse or over a potentially large range, consider using a std::map instead of an array or std::vector. In this case, defining y as a std::map<int,int> rather than an array should provide the desired behavior.
December 10th, 2009, 09:28 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Lindley

When your input is very sparse or over a potentially large range, consider using a std::map instead of an array or std::vector. In this case, defining y as a std::map<int,int> rather than an array should provide the desired behavior.

std::map is new to me. Even the C++ book I have (C++ Primer Plus) does not have it. Would you mind giving me some examples and how to use it?
December 10th, 2009, 10:31 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

std::map is new to me. Even the C++ book I have (C++ Primer Plus) does not have it. Would you mind giving me some examples and how to use it?

Nevermind. I got it. Below is the code I came up with:

Code:

#include <map> #include <vector> #include <stdio.h> #include <stdlib.h> #include <fstream> #include <iostream> #include <sstream> #include <string> using namespace std; int main() { map<string, int> mapTest; ifstream fin("in.txt"); string tempLine; int XMax; while ( getline(fin, tempLine, '\n') ) { vector <string> tempString; stringstream ss(tempLine); string tempTab; while ( getline(ss, tempTab, '\t') ) { tempString.push_back(tempTab); } int x1 = atoi(tempString[1].c_str()); int x2 = atoi(tempString[2].c_str()); for (int i=0;i<=(x2-x1);i++) { stringstream tempX; tempX << x1 + i; mapTest[tempX.str()]++; } XMax = x2; } fin.close(); ofstream fout("out.txt"); for ( int k=0;k<=XMax;k++ ) { stringstream tempK; tempK << k; if ( mapTest[tempK.str()]!=0 ) { fout << tempK.str() << "\t" << mapTest[tempK.str()] << endl; } } fout.close(); return 0; }

I am gonna test it with a very big input file to see how it performances, but any recommendation / suggestion to improve the code to work with big file is welcome. Also, I need to see some performance's reports (CPU, usage, memory etc...), so any comments are greatly appreciated.

EDIT1: Nope, the code is not finished yet. I need to determine the max value of x to put in the for loop of k.

EDIT2: Code updated.
December 10th, 2009, 11:01 PM
laserlight

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

I am gonna test it with a very big input file to see how it performances, but any recommendation / suggestion to improve the code to work with big file is welcome. Also, I need to see some performance's reports (CPU, usage, memory etc...), so any comments are greatly appreciated.

I find it easier to write that input processing loop in this way:

Code:

while (getline(fin, tempLine, '\n')) { stringstream ss(tempLine); string temp; int x1, x2; if (ss >> temp >> x1 >> x2) { // Process the data. // ... } else { // Handle input format error. } }

I do not understand why you use strings as keys for the map when it seems like you want to use integers as keys.
December 10th, 2009, 11:18 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by laserlight

I do not understand why you use strings as keys for the map when it seems like you want to use integers as keys.

I defined map<int, int> but I got error when compiling, and did not understand why, so I googled a little bit, saw people use string, char etc. as well, so I tried with string. Using int seems to shorten the code, but I need to look back why I had error.

Quote:

Originally Posted by laserlight

I find it easier to write that input processing loop in this way:

Code:

while (getline(fin, tempLine, '\n')) { stringstream ss(tempLine); string temp; int x1, x2; if (ss >> temp >> x1 >> x2) { // Process the data. // ... } else { // Handle input format error. } }

I will try your suggestion and report back. Thanks.
December 11th, 2009, 01:54 AM
dukevn

Re: best way to deal with big array or vector?

New version with laserlight and Lindley suggestions:

Code:

#include <map> #include <vector> #include <stdio.h> #include <stdlib.h> #include <fstream> #include <iostream> #include <sstream> #include <string> #include <time.h> using namespace std; int main( int argc, char *argv[] ) { time_t start,end; double diff; time (&start); map<int, int> mapTest; ifstream fin(argv[1]); string tempLine; int XMax; while ( getline(fin, tempLine, '\n') ) { stringstream ss(tempLine); string temp; int x1, x2; if ( ss >> temp >> x1 >> x2 ) { for (int i=0;i<=(x2-x1);i++) { mapTest[x1+i]++; } XMax = x2; } } fin.close(); ofstream fout(argv[2]); for ( int k=0;k<=XMax;k++ ) { if ( mapTest[k]!=0 ) { fout << k << "\t" << mapTest[k] << endl; } } fout.close(); time (&end); diff = difftime(end, start); printf("Running time: %.2lf\n",diff); return 0; }

For an input file of 84 MB, the new version is around 454s which is twice as fast as old version (924s) on a Macbook Pro Core 2 Duo 2.2 GHz and 4 GB of memory. Good news is that there is no memory error, but I still hope that I can speed it up (if possible).

Any idea how I can measure memory needed to run the code?
December 11th, 2009, 02:20 AM
laserlight

Re: best way to deal with big array or vector?

It will not make much of a difference in efficiency, but for readability this:

Code:

for (int i=0;i<=(x2-x1);i++) { mapTest[x1+i]++; }

should be:

Code:

for (int i = x1; i <= x2; ++i) { ++mapTest[i]; }

This is inefficient:

Code:

if ( mapTest[k]!=0 ) { fout << k << "\t" << mapTest[k] << endl; }

I would expect:

Code:

map<int, int>::iterator found = mapTest.find(k); if (found != mapTest.end()) { fout << k << "\t" << found->second << endl; }

However, why bother with searching? You might as well just iterate over the map:

Code:

for (map<int, int>::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i) { fout << i->first << '\t' << i->second << '\n'; } fout.flush();

This also renders your XMax variable redundant, so it can be removed along with the code to set it (which was not correct anyway, in my opinion, since it does not actually record the max).
December 11th, 2009, 08:49 AM
Philip Nicoletti

Re: best way to deal with big array or vector?

Besides the above ...

If there is nothing else on the input lines, you can get rid of the stringstream
(which tends to be slow) ...

Code:

int x1, x2; while ( fin >> tempLine >> x1 >> x2 ) { for (int i=0;i<=(x2-x1);i++) { ++mapTest[x1+i]; } }
December 11th, 2009, 10:39 AM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by laserlight

It will not make much of a difference in efficiency, but for readability this:

Code:

for (int i=0;i<=(x2-x1);i++) { mapTest[x1+i]++; }

should be:

Code:

for (int i = x1; i <= x2; ++i) { ++mapTest[i]; }

Honestly I dont understand ++x vs x++. What is the difference here?

Quote:

Originally Posted by laserlight

This is inefficient:

Code:

if ( mapTest[k]!=0 ) { fout << k << "\t" << mapTest[k] << endl; }

I would expect:

Code:

map<int, int>::iterator found = mapTest.find(k); if (found != mapTest.end()) { fout << k << "\t" << found->second << endl; }

However, why bother with searching? You might as well just iterate over the map:

Code:

for (map<int, int>::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i) { fout << i->first << '\t' << i->second << '\n'; } fout.flush();

This also renders your XMax variable redundant, so it can be removed along with the code to set it (which was not correct anyway, in my opinion, since it does not actually record the max).

I will try your suggestions. But why XMax is not correct? In my code, each line when being read, the second coordinate will be assigned as xmax, so will be the last-not-empty line, and that value should be the maximum of x. What am I missing here?
December 11th, 2009, 10:41 AM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Philip Nicoletti

Besides the above ...

If there is nothing else on the input lines, you can get rid of the stringstream
(which tends to be slow) ...

Code:

int x1, x2; while ( fin >> tempLine >> x1 >> x2 ) { for (int i=0;i<=(x2-x1);i++) { ++mapTest[x1+i]; } }

In the real input file, there are other things after the third column. Not sure about your suggestion, but I will try. Thanks.
December 11th, 2009, 10:48 AM
laserlight

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

Honestly I dont understand ++x vs x++. What is the difference here?

In this case, because it is applied to a built-in type, there should be no difference in efficiency when used in standalone expressions. However, for class types a typical implementation of pre-increment is no less efficient than post-increment, and could even be slightly more efficient.

Quote:

Originally Posted by dukevn

But why XMax is not correct? In my code, each line when being read, the second coordinate will be assigned as xmax, so will be the last-not-empty line, and that value should be the maximum of x. What am I missing here?

Consider this input:

Code:

a1 65 70 a1 12 17

The max would then be recorded as 17, not 70.
December 11th, 2009, 10:56 AM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by laserlight

In this case, because it is applied to a built-in type, there should be no difference in efficiency when used in standalone expressions. However, for class types a typical implementation of pre-increment is no less efficient than post-increment, and could even be slightly more efficient.

I see, thanks.

Quote:

Originally Posted by laserlight

Consider this input:

Code:

a1 65 70 a1 12 17

The max would then be recorded as 17, not 70.

Ah, yeah. In the original file, you would not see such inputs, since it is pre-sorted by the second column (and I forgot to mention that, my mistake :)). Anyway, you pointed out is great, cause then now the code will work with unsorted file as well.

By the way, what is difference between fout.flush(); and fout.close();?
December 11th, 2009, 10:59 AM
laserlight

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

By the way, what is difference between fout.flush(); and fout.close();?

One forces output to be written, the other forces output to be written and then closes the output stream.
December 11th, 2009, 11:40 AM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by laserlight

One forces output to be written, the other forces output to be written and then closes the output stream.

Afterward I still have to close the file, or I dont? If using flush only, then fout is still not closed when the code finishes?
December 11th, 2009, 11:42 AM
Lindley

Re: best way to deal with big array or vector?

All streams are closed when their object is destroyed. Typically this happens when they go out of scope.
December 11th, 2009, 11:51 AM
laserlight

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

Afterward I still have to close the file, or I dont?

It will be closed when the stream object is destroyed when it goes out of scope. Of course, if the end of the scope is not near, it is good to explicitly close it anyway.
December 11th, 2009, 12:13 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Lindley

All streams are closed when their object is destroyed. Typically this happens when they go out of scope.

Quote:

Originally Posted by laserlight

It will be closed when the stream object is destroyed when it goes out of scope. Of course, if the end of the scope is not near, it is good to explicitly close it anyway.

So it is fine if I use flush, but it is good (and concise) if I use close, right? Is flush more efficient than close?
December 11th, 2009, 12:19 PM
laserlight

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

So it is fine if I use flush, but it is good (and concise) if I use close, right? Is flush more efficient than close?

It is up to use. I used flush() in my example because that is what std::endl does in addition to writing a newline, and my intention was to demonstrate that there is no need to flush the stream on each iteration. You just need to flush once at the end, and you may not even need to do that explicitly.
December 11th, 2009, 12:59 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by laserlight

It is up to use. I used flush() in my example because that is what std::endl does in addition to writing a newline, and my intention was to demonstrate that there is no need to flush the stream on each iteration. You just need to flush once at the end, and you may not even need to do that explicitly.

I did not know of flush() before your example :), and I did not know that I have to (or should?) flush any stream? But I do remember that I did not have error or warning if I forgot closing a file.
December 11th, 2009, 01:01 PM
Philip Nicoletti

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

In the real input file, there are other things after the third column. Not sure about your suggestion, but I will try. Thanks.

In that case, you would need to do the following (will work if there are other
things after the third column ... or if there are only 3 columns):

Code:

while ( fin >> tempLine >> x1 >> x2 ) { getline(fin,tempLine); for (int i=0;i<=(x2-x1);i++) { ++mapTest[x1+i]; } }
December 11th, 2009, 01:07 PM
laserlight

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Philip Nicoletti

In that case, you would need to do the following

Good point. It may be more explanatory to use the ignore() member function though.
December 11th, 2009, 01:22 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Philip Nicoletti

In that case, you would need to do the following (will work if there are other
things after the third column ... or if there are only 3 columns):

Code:

while ( fin >> tempLine >> x1 >> x2 ) { getline(fin,tempLine); for (int i=0;i<=(x2-x1);i++) { ++mapTest[x1+i]; } }

You got it just right Philip. I am wondering why

Code:

while ( fin >> tempLine >> x1 >> x2 ) { string temp; int x1, x2; if ( fin >> temp >> x1 >> x2 ) { for ( int i = x1; i <= x2; ++i ) { ++mapTest[i]; } } }

neglects the first input line, then you shed a light for me :). Testing them now, and I will report back the results.
December 11th, 2009, 01:23 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by laserlight

It may be more explanatory to use the ignore() member function though.

Would you mind giving me more explanation? How do I use ignore()?
December 11th, 2009, 01:46 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

Testing them now, and I will report back the results.

OK here is the reports with an input file of 1.06GB on a cluster node of 8 cores:

- Original code: 36m2.169s
- Improved v.1 code: 12m40.918s
- Improved v.2 (without XMax): 11m39.172s
- Final code v.3 (without string stream): 11m35.974s

So there is no much difference between the last three versions (but three times as fast as the original one - a great improvement). One thing I am aiming now is how to make use of the multi-core advantage (right now the code runs only on one core), but it seems to be not that easy.

Thanks for all of your helps.
December 11th, 2009, 01:49 PM
laserlight

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

Would you mind giving me more explanation? How do I use ignore()?

Suppose according to the input format there will be a tab character between the first field and the second field. You could dispense with the temporary string variable that was used to ignore input:

Code:

int x1, x2; while (fin.ignore(1000, '\t') && (fin >> x1 >> x2)) { fin.ignore(1000, '\n'); for (int i = x1; i <= x2; ++i) { ++mapTest[i]; } }

where 1000 is arbitrarily chosen. You could have used std::numeric_limits<std::streamsize>::max() instead.
December 11th, 2009, 01:49 PM
Lindley

Re: best way to deal with big array or vector?

flush() is something you usually don't need to call yourself. It usually gets handled automatically. But it's available because occasionally you do need to call it explicitly.

Multiple cores are not going to help much for file parsing. Everything bottlenecks through the disk controller anyway. Typically, the biggest multi-core gain comes when you're doing heavy mathematical computations in main memory.
December 11th, 2009, 03:56 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by laserlight

Suppose according to the input format there will be a tab character between the first field and the second field. You could dispense with the temporary string variable that was used to ignore input:

Code:

int x1, x2; while (fin.ignore(1000, '\t') && (fin >> x1 >> x2)) { fin.ignore(1000, '\n'); for (int i = x1; i <= x2; ++i) { ++mapTest[i]; } }

where 1000 is arbitrarily chosen. You could have used std::numeric_limits<std::streamsize>::max() instead.

Got it. Thanks laserlight.
December 11th, 2009, 03:57 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Lindley

Multiple cores are not going to help much for file parsing. Everything bottlenecks through the disk controller anyway. Typically, the biggest multi-core gain comes when you're doing heavy mathematical computations in main memory.

Are you saying that splitting input file to 8 chunks, processing those 8 chunks in parallel will not help at all?
December 11th, 2009, 05:17 PM
VladimirF

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

OK here is the reports with an input file of 1.06GB on a cluster node of 8 cores:

- Original code: 36m2.169s
- Improved v.1 code: 12m40.918s
- Improved v.2 (without XMax): 11m39.172s
- Final code v.3 (without string stream): 11m35.974s

So there is no much difference between the last three versions (but three times as fast as the original one - a great improvement).

My guess is that code like that should execute at the speed of file I/O.
I know I can copy a 1GB file in about 30 seconds, so 11 minutes sounds like WAY too much.
Could you comment out everything in your code except for I/O and see how long that takes?
Do you mind posting your code and a sample data file?
December 11th, 2009, 06:17 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by VladimirF

My guess is that code like that should execute at the speed of file I/O.
I know I can copy a 1GB file in about 30 seconds, so 11 minutes sounds like WAY too much.
Could you comment out everything in your code except for I/O and see how long that takes?
Do you mind posting your code and a sample data file?

Here is the final code

Code:

#include <map> #include <vector> #include <stdio.h> #include <stdlib.h> #include <fstream> #include <iostream> #include <sstream> #include <string> #include <time.h> using namespace std; int main( int argc, char *argv[] ) { if ( argc == 1 ) { fprintf(stderr, "Usage: xytest <input> <output>\n"); return 1; } time_t start,end; double diff; time (&start); map<int, int> mapTest; ifstream fin(argv[1]); string tempLine; int x1, x2; while ( fin >> tempLine >> x1 >> x2 ) { getline(fin,tempLine); for ( int i = x1; i <= x2; ++i ) { ++mapTest[i]; } } fin.close(); ofstream fout(argv[2]); for ( map<int, int>::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i ) { fout << i->first << '\t' << i->second << '\n'; } fout.close(); time (&end); diff = difftime(end, start); printf("Running time: %.2lf\n",diff); return 0; }

Some lines in the input file is like this

Code:

a1 35952485 35952517 2 0 - a1 66611781 66611813 15 0 + a1 154104890 154104922 10 0 + a1 224631511 224631543 6 0 -

and there are 25501054 lines similar like that in the file. The size of file is 1141987065.

Let me try your suggestion to see what results are.
December 11th, 2009, 07:42 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by VladimirF

My guess is that code like that should execute at the speed of file I/O.

Looks like the ++mapTest[i] takes a lot of time. I tested with replacing ++mapTest[i] with, for example, ++k, and deleting the writing to file part, then the code needs only 45s to run. Get ++mapTest[i] back in, it needs 656s. The rest (~ a minute or more) is the time of writing to output.
December 11th, 2009, 08:07 PM
Paul McKenzie

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

Looks like the ++mapTest[i] takes a lot of time. I tested with replacing ++mapTest[i] with, for example, ++k, and deleting the writing to file part, then the code needs only 45s to run. Get ++mapTest[i] back in, it needs 656s. The rest (~ a minute or more) is the time of writing to output.

1) What compiler are you using? Version?

2) What compiler options?

3) Are you timing a debug build, or an optimized release build? If it's a debug build, those timing results are meaningless.

Regards,

Paul McKenzie
December 11th, 2009, 09:05 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Paul McKenzie

1) What compiler are you using? Version?

On Unix:

Code:

$ g++ --version g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)

On Mac Pro:

Code:

$ g++ --version i686-apple-darwin9-g++-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)

Quote:

Originally Posted by Paul McKenzie

2) What compiler options?

Code:

$ g++ xytest.c -o xytest

Quote:

Originally Posted by Paul McKenzie

3) Are you timing a debug build, or an optimized release build? If it's a debug build, those timing results are meaningless.

Not sure what you are asking here, but once I compiled and get executable file xytest, I just run it. The time reported here is the time measured inside the code.

What is debug build and what is optimized release build? How can my code be considered as debug, or release?
December 11th, 2009, 09:34 PM
Paul McKenzie

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

Not sure what you are asking here,

Well, what you don't understand is the whole ballgame.

When you compile your application, there are certain options that you must choose so that the compiler optimizes the code. If you don't do that, the compiler generates unoptimized code. Unoptimized code is used for debugging purposes (the compiler may be checking iterators and doing extra work).

The compiler options you have used do not turn on the optimizations. You need to rebuild your application using the proper optimization switches, and then retest your code.

Regards,

Paul McKenzie
December 11th, 2009, 09:47 PM
VladimirF

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

Looks like the ++mapTest[i] takes a lot of time.

Do you know much about input data? The range of values? Distribution? Tipical span of one line? Limit on the output counters? Will it fit in a byte?
I am trying to come up with some clever data structure. Like a map of short (4K?) arrays, where a key will be the low value shifted by a few bits to the right. That will provide a very fast access for your "for" loops, limiting a number of map look-ups to a number of lines in your file (or even less if you hit the same map key more than once).
The point is that you perform sequential access to your counters for each input line, and maps are not good for that.
December 11th, 2009, 10:43 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Paul McKenzie

Well, what you don't understand is the whole ballgame.

When you compile your application, there are certain options that you must choose so that the compiler optimizes the code. If you don't do that, the compiler generates unoptimized code. Unoptimized code is used for debugging purposes (the compiler may be checking iterators and doing extra work).

The compiler options you have used do not turn on the optimizations. You need to rebuild your application using the proper optimization switches, and then retest your code.

Regards,

Paul McKenzie

You got it right Paul. I've just started to learn C++ very recently. I do not know much about even C++ syntax, and I am trying to learn as much as I can. Honestly I do hear about coding optimization by choosing the right algorithms etc but never heard of optimization by compiling option. g++ input -o output is the only compiling command I know now (and option -lz if using zlib library).

What are those "certain options" that I should try? Could you point out for me some resources about those?
December 11th, 2009, 10:46 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by VladimirF

Do you know much about input data? The range of values? Distribution? Tipical span of one line? Limit on the output counters? Will it fit in a byte?

Well, just a little about input data (via some examples that I am learning and coding), and none of the rest.

Quote:

Originally Posted by VladimirF

I am trying to come up with some clever data structure. Like a map of short (4K?) arrays, where a key will be the low value shifted by a few bits to the right. That will provide a very fast access for your "for" loops, limiting a number of map look-ups to a number of lines in your file (or even less if you hit the same map key more than once).
The point is that you perform sequential access to your counters for each input line, and maps are not good for that.

I am happy to learn anything to feed up my tiny c++ background, so any hint/instruction/suggestion/recommendation are welcome.
December 11th, 2009, 10:57 PM
Paul McKenzie

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

Honestly I do hear about coding optimization by choosing the right algorithms etc but never heard of optimization by compiling option. g++ input -o output is the only compiling command I know now (and option -lz if using zlib library).

Basically, compilers, regardless of the language, are able to optimize the code that you've written (hence the term "optimizing compiler"). Things like moving invariant code out of loops so that they don't get executed over and over again is an example of an optimization that can be done by the compiler.

In your circumstance, not only will the compiler move code around to find the most efficient way to execute, but it may also eliminate "checking" code that checks for invalid iterators, array indices, etc. For example, the Visual C++ compiler has heavy iterator checking if the optimizations are turned off (and the "DEBUG" mode is chosen). An application that is built like this can run 5 to 10 times slower than an optimized version of the code.

Your compiler should have a list of options. Usually a -help or -? option lists the various command options.

Quote:

What are those "certain options" that I should try? Could you point out for me some resources about those?

I am not familiar with your compiler switches, so you will have to read the manual for the compiler, or see if a list of the options appear using the -help or -? switch. The g++ compiler is popular, but I don't know off the top of my head what those switches are (when I use g++, it is done under the CodeBlocks IDE, so I don't get too familiar with some of the command options).

Regards,

Paul McKenzie
December 11th, 2009, 11:19 PM
laserlight

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Paul McKenzie

3) Are you timing a debug build, or an optimized release build? If it's a debug build, those timing results are meaningless.

Indeed, I note that no optimisations are turned on:

Code:

$ g++ xytest.c -o xytest

dukevn, the relevant compiler documentation is available online: GCC 4.1.2 Online Manual. You should try:

Code:

$ g++ -O2 xytest.c -o xytest

Actually, you should normally enable a higher warning level, e.g.,

Code:

$ g++ -Wall -O2 xytest.c -o xytest

I also would use -pedantic to help check that I am not using some compiler extension, e.g., variable length arrays.
December 11th, 2009, 11:27 PM
Paul McKenzie

Re: best way to deal with big array or vector?

dukevn, here is a direct link to the optimization page. It explains in somewhat more detail than I did as to what optimization is and why you should use it.

http://gcc.gnu.org/onlinedocs/gcc-4....timize-Options

Regards,

Paul McKenzie
December 12th, 2009, 12:00 AM
dukevn

Re: best way to deal with big array or vector?

Reports:

- Compile: g++ -O2 -Wall xytest.c -o xytestO2. Run time: 04:44 (unix), 195s (Mac Pro)
- Compile: g++ -O3 -Wall xytest.c -o xytestO3. Run time: 03:52 (unix), 195s (Mac Pro)

It does help! Thank you very much Paul and laserlight. I am so happy that not only I have learned something, but also I got a very good resulting code.
December 12th, 2009, 12:21 AM
Speedo

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

std::map is new to me. Even the C++ book I have (C++ Primer Plus) does not have it. Would you mind giving me some examples and how to use it?

http://www.cplusplus.com/reference/

/\ Your best friend when you need a quick reference for the standard library.
December 12th, 2009, 03:32 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Speedo

http://www.cplusplus.com/reference/

/\ Your best friend when you need a quick reference for the standard library.

Thanks for the link Speedo. Actually those references do not help me as much as the examples. Normally I want to see examples first to see the interested method in action, then learn the theory/reference later.
December 12th, 2009, 03:50 PM
VladimirF

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

...I got a very good resulting code.

So are you OK with these results? (It's still taking over 3 minutes!)
Is that a kind of processing you do often? I assume that if it was a one time deal, you'd be OK with original 11 minutes.
Text file are usually very zippable, could you try to zip your 1GB file and see how big it is? Any way you could share it?
I also assume that your lates code is in post # 31 above.
December 12th, 2009, 08:02 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by VladimirF

So are you OK with these results? (It's still taking over 3 minutes!)
Is that a kind of processing you do often? I assume that if it was a one time deal, you'd be OK with original 11 minutes.
Text file are usually very zippable, could you try to zip your 1GB file and see how big it is? Any way you could share it?
I also assume that your lates code is in post # 31 above.

No, I do not have to do that processing often. I was actually pleased because I have learned a lot.

The text is around 200MB when in tar.bz2. How could I share it?

And yes, my latest code is in post 31.

Anyway, I need to move on. It seems to me that map is two component container. How could I get a three component container similar like that, with the first component is the first column of input file, second is second, and third is third? The reason I ask because now I want to write output like this

Code:

a1 12 1 a1 13 1 a1 14 1 a1 15 1 a1 16 1 a1 17 1 a1 20 1 a1 21 2 a1 22 2 a1 23 2 a1 24 2 a1 25 2
December 12th, 2009, 11:20 PM
Paul McKenzie

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

How could I get a three component container similar like that, with the first component is the first column of input file, second is second, and third is third?

When you have data that is related to each other in some way, you shouldn't think of creating separate arrays for each piece of data.

For example, a student record in a school may contain a name, address, courses taken, and a whole lot of other things related to the student. What if there are 20 separate items related to one student? Would you create 20 arrays? You'll go crazy trying to maintain 20 separate arrays -- for example, if you want to sort on the name, you have to sort all the other arrays in parallel to keep the data together.

What you should do is create a struct that contains the individual items. Then create an array of those entries.

Code:

struct Whatever { std::string c1; std::string c2; std::string c3; };

Assuming that c1, c2, and c3 are your items, you create a vector, deque, linked list, etc. of these. All the data for a single line is together, and you just create an array of them.

Regards,

Paul McKenzie
December 13th, 2009, 12:30 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Paul McKenzie

What you should do is create a struct that contains the individual items. Then create an array of those entries.

Thansk Paul for your suggestion. Here is the code I came up with

Code:

#include <map> #include <vector> #include <stdio.h> #include <stdlib.h> #include <fstream> #include <iostream> #include <sstream> #include <string> #include <time.h> using namespace std; struct KeyStruct { string Name; int x; KeyStruct(const string name,const int xval) : Name(name),x(xval) {} }; struct KeyCmp { bool operator()( const KeyStruct & k1, const KeyStruct & k2 ) { if ( k1.Name < k2.Name ) { return true; } else if ( k1.Name == k2.Name && k1.x < k2.x ) { return true; } else { return false; } } }; typedef map<KeyStruct, int, KeyCmp> mapStruct; int main( int argc, char *argv[] ) { if ( argc == 1 ) { fprintf(stderr, "Usage: xyTest <input> <output>\n"); return 1; } mapStruct mapTest; ifstream fin(argv[1]); string tempName; int x1, x2; while ( fin >> tempName >> x1 >> x2 ) { string temp; getline(fin,temp); for ( int i = x1; i <= x2; ++i ) { KeyStruct mapTemp(tempName,i); ++mapTest[mapTemp]; } } fin.close(); ofstream fout(argv[2]); for ( mapStruct::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i ) { fout << (i->first).Name << '\t' << (i->first).x << '\t' << i->second << '\n'; } fout.close(); return 0; }

It works fine, but any other comment/suggestion to shorten/improve the code is welcome.

Now I am working on how to write the summary, not all the points like in the beginning. Starting from 0, the summary should report different intervals with different counts. For example, with input file like this

Code:

a1 12 17 a1 20 25 a1 21 26 a2 13 18 a2 35 40 a1 16 21 a3 51 56 a3 65 70

the output should look like

Code:

a1 0 11 0 a1 12 15 1 a1 16 17 2 a1 18 19 1 a1 20 20 2 a1 21 21 3 a1 22 25 2 a1 26 26 1 a2 0 12 0 a2 13 18 1 a2 19 34 0 a2 35 40 1 ...

Any ideas?

Thanks,

D.
December 13th, 2009, 12:39 PM
Lindley

Re: best way to deal with big array or vector?

Quote:

Originally Posted by dukevn

Are you saying that splitting input file to 8 chunks, processing those 8 chunks in parallel will not help at all?

I can't be sure, but I wouldn't expect it to help very much. Why don't you try and find out?
December 13th, 2009, 12:46 PM
dukevn

Re: best way to deal with big array or vector?

Quote:

Originally Posted by Lindley

I can't be sure, but I wouldn't expect it to help very much. Why don't you try and find out?

I will, since it is in my learning plan, but not just now because I have a feeling that it will involve more reading and time.

Show 50 post(s) from this thread on one page

All times are GMT -5. The time now is 04:06 PM.