-
how does getline() know what line it's getting???
May be a silly question, but I often use a combination of getline() and fstream in a while loop to read a delimited text file.
Code:
ifstream read_file;
stringstream file_data_stream;
string new_line, new_cell;
// open the data file into file stream
read_file.open( input_file.c_str() );
// read in each line of the index file
while(getline(read_file, new_line)) {
// add current row to stringstream
file_data_stream << new_line;
// parse stringstream on tab to get fields
while(getline(file_data_stream, new_cell,'\t')) {
// the data is now parsed
}
}
This will read through the input_file and parse it into "cells" by line and tab. There doesn't seem to be an iterator to allow getline() to keep track of where it is in the file.
I need to read in two files so that I am reading the same line of each file one at a time. In other words, read in the first line of file 1 and then the first line of file 2. I can't see how to do this with the structure above.
If I did something like,
Code:
ifstream read_file1, read_file2;
stringstream file_data_stream1, file_data_stream2;
string new_line1, new_cell1, new_line2, new_cell2;
// open both files into file streams
read_file1.open( input_file1.c_str() );
read_file2.open( input_file2.c_str() );
// read in each line of the first input file
while(getline(read_file1, new_line1)) {
// also read the second input file
getline(read_file2, new_line2);
// add current row to stringstream
file_data_stream1 << new_line1;
file_data_stream2 << new_line2;
// parse stringstream for first file on tab to get fields
while(getline(file_data_stream1, new_cell1,'\t')) {
// also parse stringstream for second file
getline(file_data_stream2, new_cell2,'\t');
// the data for the same line of both files is now parsed
}
}
Would that read both files in registration? Does getline() have an iterator somewhere I could access to instruct it to read a specific line? Am I going about this in the wrong way altogher?
I could always just read in one file and store it, but that doesn't seem very efficient in this case.
LMHmedchem
-
Re: how does getline() know what line it's getting???
Perhaps you want:
Code:
while(getline(read_file1, new_line1) && getline(read_file2, new_line2))
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
laserlight
Perhaps you want:
Code:
while(getline(read_file1, new_line1) && getline(read_file2, new_line2))
What would the behavior be here if the two files don't have the same number of lines?
I was able to get this working with something like,
Code:
// read in each line of the first input file
while(getline(read_file1, new_line1)) {
// also read the second input file
getline(read_file2, new_line2);
// add current rows to stringstreams
file_data_stream1 << new_line1;
file_data_stream2 << new_line2;
// parse stringstream for first file on tab to get fields
while(getline(file_data_stream1, new_cell1,'\t')) {
file1_data.push_back(new_cell1);
}
// parse stringstream for second file on tab to get fields
while(getline(file_data_stream2, new_cell2,'\t')) {
file2_data.push_back(new_cell2);
}
}
I had to tab parse the new_lines in separate while loops since the number of columns may not be the same in both files.
I need to create an exception for there being a different number of rows in the two files. My code above will stop when there are no more lines in the first file. If there are fewer lines in the second file than the first, I have a way to determine that. At the moment, I have no way of knowing if there are more lines in the second file.
I could always open both files and count the lines before I start processing, but that seems inefficient. I can post the whole program if anyone is interested.
LMHmedchem
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by LMHmedchem
What would the behavior be here if the two files don't have the same number of lines?
It is an &&, which means that both conditions must be satisfied for the loop to keep running. Therefore, it will only loop for as many lines as there are in the file with fewer lines.
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
laserlight
It is an &&, which means that both conditions must be satisfied for the loop to keep running. Therefore, it will only loop for as many lines as there are in the file with fewer lines.
I added this code to count the lines in both files and then compare them.
Code:
// count the number of rows in the index file
index_line_size = count(istreambuf_iterator<char>(read_file1), istreambuf_iterator<char>(), '\n');
// count the number of rows in the index file
merge_line_size = count(istreambuf_iterator<char>(read_file2), istreambuf_iterator<char>(), '\n');
// make sure that both file have the same number of lines
if(index_line_size != merge_line_size) {
cerr << "the index file " << index_data_file << " has " << index_line_size << " lines and" <<endl;
cerr << "the merge file " << merge_data_file << " has " << merge_line_size << " lines" <<endl;
cerr << "both files must have the same number of rows" <<endl;
exit(-3);
}
The major annoyance with doing this is that I seem to have to close the files, clear the stringstream, and then open the files again to read and parse them. I guess that's not a big deal, but I can't see any way to count the lines of input and then go back to start reading the first line. I could read the files in separate loops and just store them. Then I could check the size of the containers and process if they match. I don't know if that would be better or not.
On a pair of files with ~50,000 lines and 30 columns, this runs in ~14s. There is an extra 1s from counting both files.
LMHmedchem
-
Re: how does getline() know what line it's getting???
Well, what do you want to do if the number of lines in the files don't match?
With my suggestion, you can use the eof() member function to check if EOF has been reached after the loop ends, then continue looping over the file that still has lines left to read.
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
LMHmedchem
I added this code to count the lines in both files and then compare them.
There is no need to parse the files twice if you just want to check that they have the same number of lines. laserlight's suggestion can be easily altered to do that.
Code:
bool res1, res2;
while((res1 = getline(read_file1, new_line1)) &&
(res2 = getline(read_file2, new_line2)))
{
// ...
}
if (res1 || res2)
{
// number of lines does not match
}
It's easy to add some code to count the number of lines if you need that.
The only reason I can think of to check if the files have the same number of lines first is if you want to provide an error message as soon as possible in case of error.
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
laserlight
Well, what do you want to do if the number of lines in the files don't match?
Quote:
Originally Posted by
D_Drmmr
The only reason I can think of to check if the files have the same number of lines first is if you want to provide an error message as soon as possible in case of error.
This is correct, there is an error if the number of lines don't match and the program exits.
As typically seems to happen, I started by asking one question and moved to another without adding the relevant information. I use this boilerplate code for allot of text file utilities. This particular one merges two delimited text files on a common index. The files should have the same number of rows with the index in the same order. I can sort with another tool before the merge if that is necessary. It is very important to verify that the data from the two files remains in registration and that the tool was passed the correct pair of files. This is mainly checking the line count and then matching key values when the output is written.
The benefit I can see to pre-counting the lines in the files is that you know right away if there is a miss match. If I check EOF or when getline() returns false, I won't know there is an issue until one of the files is fully read (or until there is a key mis-match since I check that line by line).
Is there a way to count the lines without having to open and close the files, and then open them again to process and parse. I am not parsing the files twice, but I do seem to have to open then twice.
LMHmedchem
-
Re: how does getline() know what line it's getting???
If you really want to perform the check before processing, then I think you mainly have to choose between reading the file twice and reading once but saving the lines read. Actually, if the files are expected to normally contain correct input, why not just process and detect the error, and upon error detection, ditch what has been processed?
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
laserlight
If you really want to perform the check before processing, then I think you mainly have to choose between reading the file twice and reading once but saving the lines read. Actually, if the files are expected to normally contain correct input, why not just process and detect the error, and upon error detection, ditch what has been processed?
This is probably the best approach since most of the time, there will not be a problem. The most likely reasons for there being an issue would be if I entered the wrong files in the arguments, or because one of the files was not sorted in the same way at the other. Both of those cases would likely fail quickly because of a mismatched key values and the second problem would not be revealed by line counting. I don't see any likely situation where the error wouldn't be detected until near the end. If it happens, oh well.
I have test processed some files where the first file is 1.5M and the second is 6.1M and both files have 42,586 rows. It takes ~14s to process these files. That seems a bit on the slow side for a compiled app. Do you see anything here that will be especially slow? I can post the entire code and some test files if anyone wants to have a look. It's about 300 lines.
LMHmedchem
-
Re: how does getline() know what line it's getting???
You could profile your code to find out where exactly is the bottleneck.
-
Re: how does getline() know what line it's getting???
Is that just the -p flag with g++,
g++ -p -o myApp myApp.cpp
I don't use gdb because I have had trouble getting it to work with my fortran code.
LMHmedchem
-
Re: how does getline() know what line it's getting???
I think it's -pg, actually.
-
1 Attachment(s)
Re: how does getline() know what line it's getting???
I am having problems getting the logic to evaluate the way I expect.
I have the while loop as,'
Code:
bool have_line1; bool have_line2;
// get each line from both file in sequence, record if a line was recieved in bool
while( (have_line1 = getline(read_file1, new_line1)) &&
(have_line2 = getline(read_file2, new_line2)) ) {
//...
}
It seems as if the bool values will always be true as long as the while evaluates as true, so I put the check code after the while loop.
Code:
// if one files runs out of lines before the other, this should be triggered
if(have_line1 != have_line2){
if(have_line1 == false) {
cerr << "the index file had fewer lines then the merge file" <<endl;
cerr << "processing did not complete normally, check output" <<endl;
exit(-3);
}
else if(have_line2 == false) {
cerr << "the merge file had fewer lines then the index file" <<endl;
cerr << "processing did not complete normally, check output" <<endl;
exit(-3);
}
}
This error is always triggered, even when processing completes normally and the printout shows that have_line1= 0 and have_line2= 1. Both files have the same number of lines, so both bool values should be 1 until the file ends, and then both 0.
I'm not sure about the logic posted by D_Drmmr
Code:
if (res1 || res2)
{
// number of lines does not match
}
That reads to me, if res1 or res2, and both of these should be true until the file finished. I almost never use Boolean logic, so I may be misunderstanding. It seems it should be,
Code:
if (!res1 || !res2)
{
// number of lines does not match
}
which would be if either is false (I think), but both should be false when the while has finished working through the file. It seems as if you are looking for the condition where one is false and one it true, meaning that control dropped out of the while when it was still getting lines from one of the files but not the other.
Am I missing the point here? I have attached my src and test files if anyone is interested.
LMHmedchem
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
LMHmedchem
Am I missing the point here? I have attached my src and test files if anyone is interested.
When you start comparing boolean values with false, it's either time to get some sleep or you've missed the point. ;)
The loop runs as long as both bools are true. That means that after the loop, at least one of the bools is false. If both are false, the two files have the same number of lines. So only if exactly one of the two is true, you have an error. Now you pick which conditional expression matches that situation.
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
D_Drmmr
There is no need to parse the files twice if you just want to check that they have the same number of lines. laserlight's suggestion can be easily altered to do that.
Code:
bool res1, res2;
while((res1 = getline(read_file1, new_line1)) &&
(res2 = getline(read_file2, new_line2)))
{
// ...
}
if (res1 || res2)
{
// number of lines does not match
}
This won't work due to the short-circuit evaluation of the && operator. Suppose we have files with equal numbers of lines. While we are reading lines in both res1 and res2 will be true. But when we finally run out, the first getline call will return false (setting res1 to false) and the while condition will thus be false. Therefore the second part of the while condition will not be evaluated and res2 will remain true.
The way to do it is make sure both getline calls are called each time round the loop using a "loop and a half":
Code:
bool res1, res2;
do {
res1 = getline(read_file1, new_line1);
res2 = getline(read_file2, new_line2);
if ( !res1 || !res2 ) {
// One of the files has run out. Do whatever needs to be done then break out of the loop
// Code ...
break;
}
// Both files were read, so do the processing...
} while (res1 && res2);
if (res1 || res2)
{
// number of lines does not match, as the while loop quit
// while one of res1, res2 was true, and the other was false
}
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
D_Drmmr
When you start comparing boolean values with false, it's either time to get some sleep or you've missed the point. ;)
Both of these conditions seem to be true on a regular basis.
Quote:
Originally Posted by
D_Drmmr
The loop runs as long as both bools are true. That means that after the loop, at least one of the bools is false. If both are false, the two files have the same number of lines.
The problem is that I am getting values that are not the same, even when the files have the same number of lines. I think the next post explains this.
Quote:
Originally Posted by
Peter_B
This won't work due to the short-circuit evaluation of the && operator. Suppose we have files with equal numbers of lines. While we are reading lines in both res1 and res2 will be true. But when we finally run out, the first getline call will return false (setting res1 to false) and the while condition will thus be false. Therefore the second part of the while condition will not be evaluated and res2 will remain true.
Thanks, I guess that is why I was getting bool values of 0 and 1 even when the files have the same number of rows and the data processes.
Using the do while you posted,
Code:
bool have_line1; bool have_line2;
do{
have_line1 = getline(read_file1, new_line1);
have_line2 = getline(read_file2, new_line2);
cout << "have_line1= " << have_line1 << endl;
cout << "have_line2= " << have_line2 << endl;
// check to see if both getline calls got a line, exit if not
if(!have_line1 || !have_line2) {
// error, neither of these should be 0 in the loop unless one file is shorter
exit (-1);
}
// process input
} while (have_line1 && have_line2);
I am still getting my error here. The printout of the two bools is,
have_line1= 1
have_line2= 1
have_line1= 1
have_line2= 1
have_line1= 1
have_line2= 1
have_line1= 1
have_line2= 1
have_line1= 1
have_line2= 1
have_line1= 1
have_line2= 1
have_line1= 1
have_line2= 1
have_line1= 1
have_line2= 1
have_line1= 1
have_line2= 1
have_line1= 1
have_line2= 1
have_line1= 0
have_line2= 0
This is what you expect in that the values should both be 1 until the files run out, and then they should both be 0. The issue is that this is printed from inside the loop and I don't see how I could still be in the loop when when both values are 0. Does this do while structure always run through the code one extra loop? Is it right that the evaluation at the end, while (have_line1 && have_line2);, means that you will always run through one last time with both bools = 0?
If I switch to,
Code:
// check to see if both getline calls got a line, exit if not
if(have_line1 != have_line2) {
// error, neither of these should be 0 in the loop unless one file is shorter
}
Then it behaves more like I expect. Did I do something wrong here?
I added some code so that it won't try to write output if both bools = 0, but I'm not sure I'm on the right track.
It looks like I don't need to check the value of the bools after the loop, since I think a mismatch in bool values will trigger the exception before the loop ends. It never hurts to have an extra trap or so, even if you think the condition can never happen.
LMHmedchem
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
Peter_B
This won't work due to the short-circuit evaluation of the && operator.
You're right. Thanks for spotting my error.
I'd rate, but I have to spread some reputation first.
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
LMHmedchem
Using the do while you posted,
I am still getting my error here. The printout of the two bools is,
...[DELETED]...
This is what you expect in that the values should both be 1 until the files run out, and then they should both be 0. The issue is that this is printed from inside the loop and I don't see how I could still be in the loop when when both values are 0. Does this do while structure always run through the code one extra loop? Is it right that the evaluation at the end, while (have_line1 && have_line2);, means that you will always run through one last time with both bools = 0?
The while condition is not continually evaluated at every point through the loop. It is only evaluated when execution reaches the while statement at the end of each time around the loop. So in your code these lines:
Code:
cout << "have_line1= " << have_line1 << endl;
cout << "have_line2= " << have_line2 << endl;
will still run even when have_line1 or have_line2 have just been set to false. You should consider these cout lines to be part of the 'process input' region of the do-while loop.
Also, this bit is not what I said:
Code:
if(!have_line1 || !have_line2) {
// error, neither of these should be 0 in the loop unless one file is shorter
exit (-1);
}
It is not an error for one (or both) of have_line1 or have_line2 to be false - it just means that one (or both) of the files have been fully read in. If both are false then both files both been exhausted at the same time, so they are the same length. You should be using 'break' to quit the loop here (as in my example), not 'exit' to stop the entire program.
Quote:
Originally Posted by
LMHmedchem
If I switch to,
Code:
// check to see if both getline calls got a line, exit if not
if(have_line1 != have_line2) {
// error, neither of these should be 0 in the loop unless one file is shorter
}
Then it behaves more like I expect. Did I do something wrong here?
If the files are the same length this condition (have_line1 != have_line2) will never be true. To give the loop a chance to finish, this condition should be checked after the loop has finished, not inside the loop. Given that have_line1 and have_line2 are booleans, there are four possible combinations of values when the loop has finished. They are:- both are true - actually not possible as the loop would still be running
- both are false - so the files were the same length
- have_line1 is true, have_line2 is false
file1 still had a line but file2 didn't, so they are unequal with file1 being longest - have_line1 is false, have_line2 is true
same as previous but file2 is longer than file1
These possibilities are covered by this check after the loop (originally posted by D_Drmmr but with changes to variable names)
Code:
if (have_line1 || have_line2)
{
// number of lines does not match
}
Quote:
Originally Posted by
LMHmedchem
It never hurts to have an extra trap or so, even if you think the condition can never happen.
You shouldn't be adding code to check conditions unless you know how those conditions could exist. Doing so indicates that you haven't studied the possible paths that execution could take through your code. And if execution could never pass through that code there is no way to test it.
Quote:
Originally Posted by
D_Drmmr
You're right. Thanks for spotting my error.
I'd rate, but I have to spread some reputation first.
That's fine - it's the thought that counts :D
-
Re: how does getline() know what line it's getting???
By the way, there is another neat way to get around the short-circuiting. That is to use the (seldom used) comma operator. This allows you to put several expressions where only one is usually allowed. So in this case:
Code:
bool res1, res2;
while (
(res1 = getline(read_file1, new_line1)), // This line executes first
(res2 = getline(read_file2, new_line2)), // Then this
(res1 && res2) // Finally, this is evaluated as the while condition
)
{
// ...
}
if (res1 || res2)
{
// number of lines does not match
}
It is not as readable though.
-
Re: how does getline() know what line it's getting???
Wouldn't it be simpler to just use operator & instead?
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
laserlight
Wouldn't it be simpler to just use operator & instead?
Actually, that would work just fine here. I honestly didn't even consider that as it is a bitwise operator rather than a logical operator. So - nice idea :)
I'll just add a couple of caveats on the use of & to avoid short-circuiting though, as it differs in a couple of important ways from the && operator - though these differences do not matter in the current case.
(@laserlight - you obviously know all this, it is intended for people who don't :))
They are:
- && evaluates to true when the operands are both any non-zero value. So (1 && 2) evaluates to true. However & does a bit-by-bit comparison so only evaluates to true when the operands are both 1 in some bit position. This means (1 & 2) evaluates to false (or, strictly speaking, to 0).
To make & work in the same way as the logical operator we need to cast the operands to bool first. This will convert any non-zero value to true. so ( (bool)1 & (bool)2 ) evaluates to true (or, again strictly speaking, 1). This cast is even needed when the operands are of type BOOL (used as return values in a lot of Windows API functions). And if you forget the cast the compiler will not help you - it will compile just fine, but not work correctly. - & does not define the order of evaluation of it's operands. The compiler is free to decide which to evaluate first in order to best optimize the code. When the evaluation does not have side-effects, or when the side-effects are independent (as in the current case) this does not matter.
The approaches I describe ("loop and a half" and comma operator) both have a well-defined evaluation order, and work for any types where non-zero means true without having to remember to cast to bool. So I think they have a wider applicability.
-
Re: how does getline() know what line it's getting???
Quote:
Originally Posted by
Peter_B
& does not define the order of evaluation of it's operands. The compiler is free to decide which to evaluate first in order to best optimize the code. When the evaluation does not have side-effects, or when the side-effects are independent (as in the current case) this does not matter.
actually, it's even worse because & does not define a sequence point ( or in c++11 lingo, the evaluations of its arguments are unsequenced, not just undeterminately sequenced ) making a potential read and write access to the same scalar object undefined behavior, which is worse then just an unspecified ordering of evaluations ( eg. the expression (( cout << "1" ) & ( cout << "2" )) can print "12" or "21", but ( c++ & c++ ) can give anything ... );
-
Re: how does getline() know what line it's getting???
Good point, superbonzo. Some compilers can help point this use out though. g++ has a 'sequence-point' warning (-Wsequence-point on the command line) which doesn't always identify problems but does in this case.
For the following expression:this gives the warning:
Code:
warning: operation on 'c' may be undefined