Re: best way to deal with big array or vector?
Quote:
Originally Posted by
VladimirF
My guess is that code like that should execute at the speed of file I/O.
I know I can copy a 1GB file in about 30 seconds, so 11 minutes sounds like WAY too much.
Could you comment out everything in your code except for I/O and see how long that takes?
Do you mind posting your code and a sample data file?
Here is the final code
Code:
#include <map>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <time.h>
using namespace std;
int main( int argc, char *argv[] ) {
if ( argc == 1 ) {
fprintf(stderr, "Usage: xytest <input> <output>\n");
return 1;
}
time_t start,end;
double diff;
time (&start);
map<int, int> mapTest;
ifstream fin(argv[1]);
string tempLine;
int x1, x2;
while ( fin >> tempLine >> x1 >> x2 ) {
getline(fin,tempLine);
for ( int i = x1; i <= x2; ++i ) {
++mapTest[i];
}
}
fin.close();
ofstream fout(argv[2]);
for ( map<int, int>::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i ) {
fout << i->first << '\t' << i->second << '\n';
}
fout.close();
time (&end);
diff = difftime(end, start);
printf("Running time: %.2lf\n",diff);
return 0;
}
Some lines in the input file is like this
Code:
a1 35952485 35952517 2 0 -
a1 66611781 66611813 15 0 +
a1 154104890 154104922 10 0 +
a1 224631511 224631543 6 0 -
and there are 25501054 lines similar like that in the file. The size of file is 1141987065.
Let me try your suggestion to see what results are.
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
VladimirF
My guess is that code like that should execute at the speed of file I/O.
Looks like the ++mapTest[i] takes a lot of time. I tested with replacing ++mapTest[i] with, for example, ++k, and deleting the writing to file part, then the code needs only 45s to run. Get ++mapTest[i] back in, it needs 656s. The rest (~ a minute or more) is the time of writing to output.
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
dukevn
Looks like the ++mapTest[i] takes a lot of time. I tested with replacing ++mapTest[i] with, for example, ++k, and deleting the writing to file part, then the code needs only 45s to run. Get ++mapTest[i] back in, it needs 656s. The rest (~ a minute or more) is the time of writing to output.
1) What compiler are you using? Version?
2) What compiler options?
3) Are you timing a debug build, or an optimized release build? If it's a debug build, those timing results are meaningless.
Regards,
Paul McKenzie
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
Paul McKenzie
1) What compiler are you using? Version?
On Unix:
Code:
$ g++ --version
g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
On Mac Pro:
Code:
$ g++ --version
i686-apple-darwin9-g++-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)
Quote:
Originally Posted by
Paul McKenzie
2) What compiler options?
Code:
$ g++ xytest.c -o xytest
Quote:
Originally Posted by
Paul McKenzie
3) Are you timing a debug build, or an optimized release build? If it's a debug build, those timing results are meaningless.
Not sure what you are asking here, but once I compiled and get executable file xytest, I just run it. The time reported here is the time measured inside the code.
What is debug build and what is optimized release build? How can my code be considered as debug, or release?
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
dukevn
Not sure what you are asking here,
Well, what you don't understand is the whole ballgame.
When you compile your application, there are certain options that you must choose so that the compiler optimizes the code. If you don't do that, the compiler generates unoptimized code. Unoptimized code is used for debugging purposes (the compiler may be checking iterators and doing extra work).
The compiler options you have used do not turn on the optimizations. You need to rebuild your application using the proper optimization switches, and then retest your code.
Regards,
Paul McKenzie
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
dukevn
Looks like the ++mapTest[i] takes a lot of time.
Do you know much about input data? The range of values? Distribution? Tipical span of one line? Limit on the output counters? Will it fit in a byte?
I am trying to come up with some clever data structure. Like a map of short (4K?) arrays, where a key will be the low value shifted by a few bits to the right. That will provide a very fast access for your "for" loops, limiting a number of map look-ups to a number of lines in your file (or even less if you hit the same map key more than once).
The point is that you perform sequential access to your counters for each input line, and maps are not good for that.
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
Paul McKenzie
Well, what you don't understand is the whole ballgame.
When you compile your application, there are certain options that you must choose so that the compiler optimizes the code. If you don't do that, the compiler generates unoptimized code. Unoptimized code is used for debugging purposes (the compiler may be checking iterators and doing extra work).
The compiler options you have used do not turn on the optimizations. You need to rebuild your application using the proper optimization switches, and then retest your code.
Regards,
Paul McKenzie
You got it right Paul. I've just started to learn C++ very recently. I do not know much about even C++ syntax, and I am trying to learn as much as I can. Honestly I do hear about coding optimization by choosing the right algorithms etc but never heard of optimization by compiling option. g++ input -o output is the only compiling command I know now (and option -lz if using zlib library).
What are those "certain options" that I should try? Could you point out for me some resources about those?
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
VladimirF
Do you know much about input data? The range of values? Distribution? Tipical span of one line? Limit on the output counters? Will it fit in a byte?
Well, just a little about input data (via some examples that I am learning and coding), and none of the rest.
Quote:
Originally Posted by
VladimirF
I am trying to come up with some clever data structure. Like a map of short (4K?) arrays, where a key will be the low value shifted by a few bits to the right. That will provide a very fast access for your "for" loops, limiting a number of map look-ups to a number of lines in your file (or even less if you hit the same map key more than once).
The point is that you perform sequential access to your counters for each input line, and maps are not good for that.
I am happy to learn anything to feed up my tiny c++ background, so any hint/instruction/suggestion/recommendation are welcome.
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
dukevn
Honestly I do hear about coding optimization by choosing the right algorithms etc but never heard of optimization by compiling option. g++ input -o output is the only compiling command I know now (and option -lz if using zlib library).
Basically, compilers, regardless of the language, are able to optimize the code that you've written (hence the term "optimizing compiler"). Things like moving invariant code out of loops so that they don't get executed over and over again is an example of an optimization that can be done by the compiler.
In your circumstance, not only will the compiler move code around to find the most efficient way to execute, but it may also eliminate "checking" code that checks for invalid iterators, array indices, etc. For example, the Visual C++ compiler has heavy iterator checking if the optimizations are turned off (and the "DEBUG" mode is chosen). An application that is built like this can run 5 to 10 times slower than an optimized version of the code.
Your compiler should have a list of options. Usually a -help or -? option lists the various command options.
Quote:
What are those "certain options" that I should try? Could you point out for me some resources about those?
I am not familiar with your compiler switches, so you will have to read the manual for the compiler, or see if a list of the options appear using the -help or -? switch. The g++ compiler is popular, but I don't know off the top of my head what those switches are (when I use g++, it is done under the CodeBlocks IDE, so I don't get too familiar with some of the command options).
Regards,
Paul McKenzie
Re: best way to deal with big array or vector?
Quote:
Originally Posted by Paul McKenzie
3) Are you timing a debug build, or an optimized release build? If it's a debug build, those timing results are meaningless.
Indeed, I note that no optimisations are turned on:
Code:
$ g++ xytest.c -o xytest
dukevn, the relevant compiler documentation is available online: GCC 4.1.2 Online Manual. You should try:
Code:
$ g++ -O2 xytest.c -o xytest
Actually, you should normally enable a higher warning level, e.g.,
Code:
$ g++ -Wall -O2 xytest.c -o xytest
I also would use -pedantic to help check that I am not using some compiler extension, e.g., variable length arrays.
Re: best way to deal with big array or vector?
dukevn, here is a direct link to the optimization page. It explains in somewhat more detail than I did as to what optimization is and why you should use it.
http://gcc.gnu.org/onlinedocs/gcc-4....timize-Options
Regards,
Paul McKenzie
Re: best way to deal with big array or vector?
Reports:
- Compile: g++ -O2 -Wall xytest.c -o xytestO2. Run time: 04:44 (unix), 195s (Mac Pro)
- Compile: g++ -O3 -Wall xytest.c -o xytestO3. Run time: 03:52 (unix), 195s (Mac Pro)
It does help! Thank you very much Paul and laserlight. I am so happy that not only I have learned something, but also I got a very good resulting code.
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
dukevn
std::map is new to me. Even the C++ book I have (C++ Primer Plus) does not have it. Would you mind giving me some examples and how to use it?
http://www.cplusplus.com/reference/
/\ Your best friend when you need a quick reference for the standard library.
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
Speedo
Thanks for the link Speedo. Actually those references do not help me as much as the examples. Normally I want to see examples first to see the interested method in action, then learn the theory/reference later.
Re: best way to deal with big array or vector?
Quote:
Originally Posted by
dukevn
...I got a very good resulting code.
So are you OK with these results? (It's still taking over 3 minutes!)
Is that a kind of processing you do often? I assume that if it was a one time deal, you'd be OK with original 11 minutes.
Text file are usually very zippable, could you try to zip your 1GB file and see how big it is? Any way you could share it?
I also assume that your lates code is in post # 31 above.