I am trying c++ with some maths now, and I started with simple 2D vector (or array, or matrix, what ever you call it). What I want to do is: in the file below
each line is one "line", or "vector", with the second and third columns are x coordinates. The program read line by line, reads those two x-coordinates (say x1 and x2), and gives y at any point between those two xs additional one unit (so y(20) = y(21) = ... = y(25) = 1 for the first line). y of all points start at zero. You can see that y(21) = 2 since 21 appears in two intervals (two first lines), y(37) = 3 since 37 appears in three lines.
After reading the file, it should write to a new text file with the non-zero points only, point by line, like below:
Quite simple if the input file is small. The algorithm I came up with is:
- define array y from 0 to max of x, initial values are all zero.
- read the file, if x is from x1 to x2, then y(x)++;
- after done, write to file: from 0 -> x max, if y(x) is not zero, then write it down.
Problem is if x max is big, then the defined array would be big, and that is a waste of memory as well since a lot of y array is zero. Also, if x max is very big (millions to billions) then the algorithm would not work.
When your input is very sparse or over a potentially large range, consider using a std::map instead of an array or std::vector. In this case, defining y as a std::map<int,int> rather than an array should provide the desired behavior.
When your input is very sparse or over a potentially large range, consider using a std::map instead of an array or std::vector. In this case, defining y as a std::map<int,int> rather than an array should provide the desired behavior.
std::map is new to me. Even the C++ book I have (C++ Primer Plus) does not have it. Would you mind giving me some examples and how to use it?
I am gonna test it with a very big input file to see how it performances, but any recommendation / suggestion to improve the code to work with big file is welcome. Also, I need to see some performance's reports (CPU, usage, memory etc...), so any comments are greatly appreciated.
EDIT1: Nope, the code is not finished yet. I need to determine the max value of x to put in the for loop of k.
EDIT2: Code updated.
Last edited by dukevn; December 10th, 2009 at 09:41 PM.
I am gonna test it with a very big input file to see how it performances, but any recommendation / suggestion to improve the code to work with big file is welcome. Also, I need to see some performance's reports (CPU, usage, memory etc...), so any comments are greatly appreciated.
I find it easier to write that input processing loop in this way:
Code:
while (getline(fin, tempLine, '\n')) {
stringstream ss(tempLine);
string temp;
int x1, x2;
if (ss >> temp >> x1 >> x2) {
// Process the data.
// ...
} else {
// Handle input format error.
}
}
I do not understand why you use strings as keys for the map when it seems like you want to use integers as keys.
C + C++ Compiler: MinGW port of GCC
Build + Version Control System: SCons + Bazaar
I do not understand why you use strings as keys for the map when it seems like you want to use integers as keys.
I defined map<int, int> but I got error when compiling, and did not understand why, so I googled a little bit, saw people use string, char etc. as well, so I tried with string. Using int seems to shorten the code, but I need to look back why I had error.
Originally Posted by laserlight
I find it easier to write that input processing loop in this way:
Code:
while (getline(fin, tempLine, '\n')) {
stringstream ss(tempLine);
string temp;
int x1, x2;
if (ss >> temp >> x1 >> x2) {
// Process the data.
// ...
} else {
// Handle input format error.
}
}
I will try your suggestion and report back. Thanks.
New version with laserlight and Lindley suggestions:
Code:
#include <map>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <time.h>
using namespace std;
int main( int argc, char *argv[] ) {
time_t start,end;
double diff;
time (&start);
map<int, int> mapTest;
ifstream fin(argv[1]);
string tempLine;
int XMax;
while ( getline(fin, tempLine, '\n') ) {
stringstream ss(tempLine);
string temp;
int x1, x2;
if ( ss >> temp >> x1 >> x2 ) {
for (int i=0;i<=(x2-x1);i++) {
mapTest[x1+i]++;
}
XMax = x2;
}
}
fin.close();
ofstream fout(argv[2]);
for ( int k=0;k<=XMax;k++ ) {
if ( mapTest[k]!=0 ) {
fout << k << "\t" << mapTest[k] << endl;
}
}
fout.close();
time (&end);
diff = difftime(end, start);
printf("Running time: %.2lf\n",diff);
return 0;
}
For an input file of 84 MB, the new version is around 454s which is twice as fast as old version (924s) on a Macbook Pro Core 2 Duo 2.2 GHz and 4 GB of memory. Good news is that there is no memory error, but I still hope that I can speed it up (if possible).
Any idea how I can measure memory needed to run the code?
It will not make much of a difference in efficiency, but for readability this:
Code:
for (int i=0;i<=(x2-x1);i++) {
mapTest[x1+i]++;
}
should be:
Code:
for (int i = x1; i <= x2; ++i) {
++mapTest[i];
}
This is inefficient:
Code:
if ( mapTest[k]!=0 ) {
fout << k << "\t" << mapTest[k] << endl;
}
I would expect:
Code:
map<int, int>::iterator found = mapTest.find(k);
if (found != mapTest.end()) {
fout << k << "\t" << found->second << endl;
}
However, why bother with searching? You might as well just iterate over the map:
Code:
for (map<int, int>::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i) {
fout << i->first << '\t' << i->second << '\n';
}
fout.flush();
This also renders your XMax variable redundant, so it can be removed along with the code to set it (which was not correct anyway, in my opinion, since it does not actually record the max).
C + C++ Compiler: MinGW port of GCC
Build + Version Control System: SCons + Bazaar
It will not make much of a difference in efficiency, but for readability this:
Code:
for (int i=0;i<=(x2-x1);i++) {
mapTest[x1+i]++;
}
should be:
Code:
for (int i = x1; i <= x2; ++i) {
++mapTest[i];
}
Honestly I dont understand ++x vs x++. What is the difference here?
Originally Posted by laserlight
This is inefficient:
Code:
if ( mapTest[k]!=0 ) {
fout << k << "\t" << mapTest[k] << endl;
}
I would expect:
Code:
map<int, int>::iterator found = mapTest.find(k);
if (found != mapTest.end()) {
fout << k << "\t" << found->second << endl;
}
However, why bother with searching? You might as well just iterate over the map:
Code:
for (map<int, int>::iterator i = mapTest.begin(), end = mapTest.end(); i != end; ++i) {
fout << i->first << '\t' << i->second << '\n';
}
fout.flush();
This also renders your XMax variable redundant, so it can be removed along with the code to set it (which was not correct anyway, in my opinion, since it does not actually record the max).
I will try your suggestions. But why XMax is not correct? In my code, each line when being read, the second coordinate will be assigned as xmax, so will be the last-not-empty line, and that value should be the maximum of x. What am I missing here?
Honestly I dont understand ++x vs x++. What is the difference here?
In this case, because it is applied to a built-in type, there should be no difference in efficiency when used in standalone expressions. However, for class types a typical implementation of pre-increment is no less efficient than post-increment, and could even be slightly more efficient.
Originally Posted by dukevn
But why XMax is not correct? In my code, each line when being read, the second coordinate will be assigned as xmax, so will be the last-not-empty line, and that value should be the maximum of x. What am I missing here?
Consider this input:
Code:
a1 65 70
a1 12 17
The max would then be recorded as 17, not 70.
C + C++ Compiler: MinGW port of GCC
Build + Version Control System: SCons + Bazaar
In this case, because it is applied to a built-in type, there should be no difference in efficiency when used in standalone expressions. However, for class types a typical implementation of pre-increment is no less efficient than post-increment, and could even be slightly more efficient.
I see, thanks.
Originally Posted by laserlight
Consider this input:
Code:
a1 65 70
a1 12 17
The max would then be recorded as 17, not 70.
Ah, yeah. In the original file, you would not see such inputs, since it is pre-sorted by the second column (and I forgot to mention that, my mistake ). Anyway, you pointed out is great, cause then now the code will work with unsorted file as well.
By the way, what is difference between fout.flush(); and fout.close();?
Bookmarks