Read binary file with line delimeter
Hello to all,
First post in this concurred forum. I hope someone could help me.
I want to read a binary file using as line separator "ff77" in order to parse further each line one by one with some regex
since the file is big. I have a small ruby code shown below, but I'm new in C++, and I don't know how to replicate in C++
what this ruby code does.
Code:
#!/usr/bin/env ruby
BEGIN{ $/="\xff\x77" } # Line separator = FF77
File.open(ARGV[0],"rb") # Open in binary mode
# Process each line one by one
while gets
line = $_.unpack('H*')[0] #Storing the bytes for each line in "line "variable
next unless line =~ /(..)(\d+)([A-B])/ # Regex with back-reference
printf("%d %s %s\n",$1,$2,$3) #Printing backreferenced patterns
end
I've been looking for a way to set the line delimeter and found getline function, but it seems getline only accepts one character
and I need 4 characters as line separator.
My attempt without success is below, it seems is not in that way.
Code:
#include <cstdlib>
#include <fstream>
int main() {
std::ifstream input("C:\\binfile", ios::in | ios::binary);
for( std::string line; getline( input, "ff77" ); )
{
printf("%s",line);
}
return 0;
}
Many thanks in advance for any help.
Re: Read binary file with line delimeter
Quote:
Originally Posted by
Philidor
Hello to all,
First post in this concurred forum. I hope someone could help me.
I want to read a binary file using as line separator "ff77" in order to parse further each line one by one with some regex
Opening a file in binary mode means that you're on your own and you get no help from C++ as to what is or are "end-of-line" character(s). That luxury goes to opening a file in text mode (and even that is limited).
In other words, there is no such thing as a "line separator" to the C++ stream when you open a file in binary mode. You have to parse the line yourself with the knowledge of what is a "line separator".
Regards,
Paul McKenzie
Re: Read binary file with line delimeter
Hello Paul,
Thanks for the answer. The term "line separator" I've used like a way to separate the data by blocks, since each block begins with begins with
77 and ends with FF. So, when FF77 is found it means a new block begins.
The issue is I don't know how to separate each block to parse it one at a time.
Thanks in advance for any help.
Re: Read binary file with line delimeter
Well, how would you conceptually read a block of memory and look for delimeters within that block of memory, while reataining the text between the delimiters?
Regards,
Paul McKenzie
Re: Read binary file with line delimeter
Hello Paul,
That is something similar to what I'm asking for help, I'm really a newbie in programming, the ruby code wasn't done by me.
Maybe use an if statement to match ff 77 to know where begins a block. Maybe exists method more directly in C++,
I don't know.
Maybe you or somebody else could help me to be able to store each block in a variable to have the option to
parse this string later.
Thanks in advance for the help.
Re: Read binary file with line delimeter
Quote:
Originally Posted by
Philidor
Hello Paul,
That is something similar to what I'm asking for help, I'm really a newbie in programming, the ruby code wasn't done by me.
Then you need someone already versed in C++ or programming in general to write this code. Or take the time to learn how to conceptualize a problem, write a plan on how to solve the problem using pencil and paper (no code), and then translate what you wrote to C++ code.
Quote:
Maybe use an if statement to match ff 77 to know where begins a block. Maybe exists method more directly in C++,
There isn't one. C++ is not Ruby, and I think this was your initial mistake. You equated what you can do with Ruby in one or two lines of code, and hoped that C++ could do the same thing with similar effort. That is not the case.
For C++, and really, any programming language you have to:
1) Read a block into memory.
2) Search the block of memory for your delimited string sequence.
3) While doing this, retain where the text began and where the delimiter was found -- between these two points is the text.
4) Save this text in some sort of container.
5) Skip over the found delimiter, set the pointer to the characters after the delimiter, and repeat steps 2 through 5.
...
Basically, it is a delimited file parser, with the delimiter equals "ff77". This is not trivial if you don't know how to write a program. Throw into the mix that you have to read the file in chunks, so you have to check to see if you read only enough to get a "partial line", and know that your next read will give you the rest of that line.
Quote:
Maybe you or somebody else could help me to be able to store each block in a variable to have the option to
parse this string later.
You want a comma-delimited file parser program or function (but allow the "comma" to be some other set of characters that delimits the text). That is as close as you can come to a "canned solution" in C++ (even though it isn't really canned, it's just that someone wrote the function to do so).
Regards,
Paul McKenzie
Re: Read binary file with line delimeter
Hello Paul,
Thanks for the help.
I've been able to do steps 2 to 4 and partially 5, since I'm don't know how to set the correct condition for the "while loop" to stops when any other delimiter is found in the current block of memory that is being read.
What I've done is:
Code:
while (not end of current block of memory) { // This is the condition I don't know how could be
x1 = curr_string.find("ff77",x2-1,4);
x2 = curr_string.find("ff77",x1+1,4);
string temp=curr_string.substr(x1, x2 - x1);
}
The condition I've tried is below, but I get infinite loop:
Code:
curr_string.find("ff77",x1+1,4)
Thanks again.
Re: Read binary file with line delimeter
Quote:
Originally Posted by
Philidor
Hello Paul,
Thanks for the help.
I've been able to do steps 2 to 4 and partially 5, since I'm don't know how to set the correct condition for the "while loop" to stops when any other delimiter is found in the current block of memory that is being read.
You know how big the block is. The string variable has a size() argument.
Why not start with something simple? Assume the file is comma delimited (a simple 1 character delimiter), and you had to extract the text between the commas. Forget about file, how about a simple hard-coded string:
Code:
#include <string>
#include <vector>
std::vector<std::string> getCommaFields(const std::string& commaStr)
{
//
}
int main()
{
std::vector<std::string> sVector;
sVector = getCommaFields("Test1,Test2,This is test3");
}
The code is supposed to take that string, and extract the text that is between the commas. Each text is stored in the vector of strings and is returned. So on return, sVector must be the following:
Code:
sVector[0] = "Test1"
sVector[1] = "Test2"
sVector[2] = "This is a test3"
If you can't write that function, at least to 95% completeness, then you should start here. Once you have it done, look at the code, and change it to try multiple character delimiters.
Regards,
Paul McKenzie
Re: Read binary file with line delimeter
Hello Paul,
Thanks for the suggestion, I'll try to think how to get a function that works for this.
One question, this way would be fine thinking that the real file I need to read is more than 2 GB? since I think if I'll need to read for example 1000 bytes and apply the code you suggests me or open the complete file, I don't know.
Thanks again for the help.
Re: Read binary file with line delimeter
Quote:
Originally Posted by
Philidor
One question, this way would be fine thinking that the real file I need to read is more than 2 GB? since I think if I'll need to read for example 1000 bytes and apply the code you suggests me or open the complete file, I don't know.
What you would do is read (much more than) 1000 bytes into a buffer. Then you parse the buffer for the character sequence that terminates each line.
The issue is that if your read straddles a line or the character sequence, which means that the next read of 1,000 bytes completes the string (or line terminator) and you have to take that into consideration.
Regards,
Paul McKenzie
Re: Read binary file with line delimeter
Quote:
Originally Posted by
Paul McKenzie
What you would do is read (much more than) 1000 bytes into a buffer. Then you parse the buffer for the character sequence that terminates each line.
The issue is that if your read straddles a line or the character sequence, which means that the next read of 1,000 bytes completes the string (or line terminator) and you have to take that into consideration.
Regards,
Paul McKenzie
Hello Paul,
Thanks for your reply, I'm taking your suggestions and I've been trying with the code below, the positions where commas ocurre are fine, but I get errors (Run exit value 1) to assing the substring to the V[i] (in red).
I'm putting the condition "pos2<10000" because when a value is not found I receive the value 18446744073709551615.
Code:
#include <string>
#include <vector>
#include <iostream>
using namespace std;
vector<string> getCommaFields(const string& commaStr)
{
int i = 0;
size_t pos1 = 1;
size_t pos2 = 1;
vector<string> V;
string str=commaStr;
while (pos2<10000) {
pos1 = commaStr.find(",",pos2-1,1);
pos2 = commaStr.find(",",pos1+1,1);
//if (pos2<10000){
// V[i]=commaStr.substr(pos1, pos2 - pos1);
//}
cout<<pos1<<","<<pos2<<","<<str<<endl;
i++;
}
// return(V);
}
int main()
{
//const commaStr = "Test1,Test2,This is test3";
vector<string> sVector;
sVector = getCommaFields("Test1,Test2,Test3,Some text");
}
Thanks in advance for any help.
Re: Read binary file with line delimeter
That's because you haven't sized V, so initially V has no elements. Use push_back().
Code:
V.push_back(commaStr.substr(pos1, pos2 - pos1));
Quote:
I'm putting the condition "pos2<10000" because when a value is not found I receive the value 18446744073709551615.
When no match is found for the find, it returns string::npos
http://www.cplusplus.com/reference/string/string/find/
There are also some logic errors (you only need 1 find in the while loop) but stepping through the code with the debugger and comparing the result with the function design should enable these to be found fairly easily.
Re: Read binary file with line delimeter
Quote:
Originally Posted by
Philidor
Hello Paul,
Thanks for your reply, I'm taking your suggestions and I've been trying with the code below, the positions where commas ocurre are fine, but I get errors (Run exit value 1) to assing the substring to the V[i] (in red).
Well, one thing is that you should not assume your string is less than 10,000 characters.
Code:
while (pos2<10000) {
The std::string has a size() function that returns you the number of characters. You should be using the value of size(), and not hard-code 10,000.
Quote:
I'm putting the condition "pos2<10000" because when a value is not found I receive the value 18446744073709551615.
Always know what standard library functions will return:
http://www.cplusplus.com/reference/string/string/find/
Read the section on the return value when the string cannot be found.
Regards,
Paul McKenzie
Re: Read binary file with line delimeter
Hello 2kaud and Paul,
Thanks for your help. I was able to do a function to return Vector elements as Paul said with comma delimiters and then I've changed to "FF77" and the code below it seems to work. The element "Test1" is not consider since in the real file the first characters shouldn't be consider, so that part is not incorrect.
I deleted 1 find in the loop, maybe you can see if the code so far has some issues or something to improve.
And besides any issue you can see that could be improved, I have 2 problems,
1- I get exit value 1 using the 2 lines in red to get the position of last field separator.
2- I wanted to replace with a variable the delimiter string, but for some reason the error says that is expected 2 parameters and provided 3 (this if I use the line in blue and replace "FF77" with Sep in all places).
Code:
#include <string>
#include <vector>
#include <iostream>
using namespace std;
vector<string> getFields(const string& FSepStr)
{
int i = 0;
size_t pos = 1;
size_t LastFS;
vector<string> V;
//string Sep = "FF77";
while (FSepStr.find("FF77",pos+1,4)!=string::npos) {
pos = FSepStr.find("FF77",FSepStr.find("FF77",pos+1,4)-1,4);
if (FSepStr.find("FF77",pos+1,4)!=string::npos){
V.push_back(FSepStr.substr(pos+4, FSepStr.find("FF77",pos+1,4) - pos - 4));
}
i++;
}
return V;
}
int main()
{
const string InputStr = "Test1FF77Test2FF77Test3FF77Some textFF77other textFF772";
vector<string> sVector;
sVector = getFields(InputStr);
//size_t LastFS = InputStr.rfind("FF77");
for (int i=0;i<=sVector.size();i++){
cout<<"V["<<i<<"]="<<sVector[i]<<endl;
}
//cout <<"Last FSep: "<<LastFS<<endl;
}
Output:
Code:
V[0]=Test2
V[1]=Test3
V[2]=Some text
V[3]=other text
RUN FAILED (exit value 1, total time: 90ms)
Thanks again for the help.
Re: Read binary file with line delimeter
Code:
for (int i=0; i<=sVector.size();i++)
You are going beyond the bounds of the vector. Vectors (and arrays) in C++ start from 0 and go to n-1, where "n" is the number of elements. If that vector has 10 elements in it, you are erroneously going from 0 to 10 instead of 0 to 9. That's why you have a failure at the end of your program.
Regards,
Paul McKenzie