-
compare two files
Could anyone please help me how to compare 2 files containing lines of string? The two files contain similar data except that one contains data with duplicates while the other one contains all unique. I have to make sure if the file having unique data contains all the lines present in the file containing duplicates.
Both files look something like this and I have to compare only the lines which has the sub string CME in it which comes right before the first coma in the string
Code:
BREACH:0:40:GE:20100601-07:34:22.796
0:1:ORDER ID:0000D9DB
0:2:ORDER ID:0000D9DC
0:1:TRDR:GRC
0:2:TRDR:GRC
0:0:TRADE CROSSING IDS:14ijpol1vsu7l7,yfrea51dwd8jx
0:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
0:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
0:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140
0:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140
0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35
0:0:WASH-ORD-TIME-DIFF,2.703
0:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693
BREACH:1:40:GE:20100601-07:34:22.796
1:1:ORDER ID:0000D9DB
1:2:ORDER ID:0000D9DC
1:1:TRDR:GRC
1:2:TRDR:GRC
1:0:TRADE CROSSING IDS:1iyucih1bcmpso,d88hmz15psx80
1:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
1:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
1:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140
1:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140
1:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
1:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34
1:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34
1:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35
1:0:WASH-ORD-TIME-DIFF,2.703
1:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693
BREACH:40:40:GE:20100601-14:08:35.406
40:1:ORDER ID:0000D9XN
40:2:ORDER ID:0000DBHJ
40:1:TRDR:DAF
40:2:TRDR:DAF
40:0:TRADE CROSSING IDS:4hr6iu6smidw,1t6btger8juyg
40:1:OrderReceive,01.06.2010 09:58:50.031,0323YK058,0000D9XN,A,25,0,25,0,-35
40:1:OrderReceive,01.06.2010 09:50:42.290,0323YK058,0000D9XN,A,25,0,25,0,-35
40:2:OrderReceive,01.06.2010 14:07:29.062,0323YK153,0000DBHJ,A,7,0,7,0,160
40:2:OrderReceive,01.06.2010 13:59:20.853,0323YK153,0000DBHJ,A,7,0,7,0,160
40:1:CME,20100601,12:45:46.250,DAF,GE,201012,FUT,XGDAF,0323YK058,1kxdklb1oghepj,Leg Fill,0000D9XN,B,00000,-3.5,2,99,20100601,12:38:57
40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,1v0f5tr1da5176,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57
40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,61wcagnnjmjl,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57
40:1:CME,20100601,12:45:46.265,DAF,GE,201106,FUT,XGDAF,0323YK058,v6d4b2agp1hr,Leg Fill,0000D9XN,B,00000,-3.5,2,98.715,20100601,12:38:57
40:0:WASH-ORD-TIME-DIFF,14918.6
40:0:2nd-ORD-TO-WASH-TIME-DIFF,554.553
BREACH:101:30:GE:20100601-07:18:05.015
101:1:ORDER ID:0001U8QR
101:2:ORDER ID:0001W0PJ
101:1:TRDR:MTJ
101:2:TRDR:FDC
101:0:TRADE CROSSING IDS:1ua0o7twia2cx,p3mqxj1it2iao
101:1:OrderReceive,01.06.2010 07:18:05.015,082X7Y007,0001U8QR,A,1,0,1,0,99025
101:1:OrderReceive,01.06.2010 07:09:57.556,082X7Y007,0001U8QR,A,1,0,1,0,99025
101:2:OrderReceive,01.06.2010 07:18:04.468,0323X8076,0001W0PJ,A,10,0,10,0,145
101:2:OrderReceive,01.06.2010 07:09:57.009,0323X8076,0001W0PJ,A,10,0,10,0,145
101:1:CME,20100601,07:18:05.015,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16
101:1:CME,20100601,07:09:57.556,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16
101:2:CME,20100601,07:18:04.468,FDC,GE,201009,FUT,XGFDC,0323X8076,1t6wkc41c2ki0m,Leg Fill,0001W0PJ,S,00000,14.5,3,99.17,20100601,07:11:15
101:2:CME,20100601,07:18:04.468,FDC,GE,201012,FUT,XGFDC,0323X8076,15vxdjj1r1imja,Leg Fill,0001W0PJ,B,00000,14.5,3,99.025,20100601,07:11:15
101:0:CROSS-ORD-TIME-DIFF,0.547
-
Re: compare two files
Are these files sorted? At least the entries that you are interested in?
Is it OK for the file with unique entries to contain the stuff not in the second file?
Do entire lines have to match, or just some part of it?
-
Re: compare two files
head over to C++ boost
www.boost.org , then read on spirit , it a parser for C++ that will help you create parser to do just about anything.
-
Re: compare two files
Vladmir.
I think the entries I am interested are sorted in the file that contains unique values but are not sorted in the one that also contains duplicates in it.
The file with unique entries has actually been extracted from the file with duplicate values and I want make certain that unique file contains all the entries from the duplicate file.
Not entire lines but just some part of the string should match starting from GRC till the end
Code:
GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
Many thanks
-
Re: compare two files
There are many lines in your sample that have CME, in them ... but do not have GRC
-
Re: compare two files
You could use unix commands for this.
Code:
grep ':CME,' filename | sed -e 's/expression to parse out what you want/\1/' | uniq
The grep selects the lines you're interested in from the source file.
The sed takes only the part of the line that has too be compared
uniq deletes duplicates
and then use diff to find differences
If you want to do it in C++, you have the same steps to do. Apart from the fact that programming a good diff program is not trivial (how do you know when the files re-synchronize?).
-
Re: compare two files
Philip Nicoletti it is not the GRC, it is actually the position at which this three letter word occurs.
-
Re: compare two files
What is the position ... consider these 3 lines. How do you
determine the position to start for comparinson:
Code:
0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,61wcagnnjmjl,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57
101:1:CME,20100601,07:18:05.015,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16
-
Re: compare two files
Is this EDIFACT by any chance?
Anyway, a simple sort on remove duplicate algorithm should do the work, the only problem is that we aren't quite sure what your definition of a duplicate is.
Could you give us an exact description of what you consider to be equal lines?
-
Re: compare two files
Philip Nicoletti
It is the position after the third comma
monarch_dodra
Equal lines are those lines whose substring before the first colon (like 0: or 40: etc) and the substring that starts from the position after the 3rd comma from the beginning till the end of the line are exactly the same like:
Code:
0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
which in this case is 0 and the string from GRC till the end.
-
Re: compare two files
This would be my approach:
First, use this struct:
Code:
struct line
{
std::string full_line;
std::string sub_line;
}
bool operator<(const line& lhs, const line& rhs);
full_line is the entire reference line. Used only for reference.
sub_line is the substring.
you want to implement operator< to compare sub_line only.
Now, you can build an std::set<line>, that you fill with line structs 1 by 1. Do this for both files. This should give you 2 different sets at the end.
Once you have done this, you can use the following algorithms to compare your sets:
set_union Union of two sorted ranges (function template)
set_intersection Intersection of two sorted ranges (function template)
set_difference Difference of two sorted ranges (function template)
set_symmetric_difference Symmetric difference of two sorted ranges (function template)
In your case, you'd want to use set_difference to check the lines in your first file that aren't in your unique files.
Depending on your needs, you may also consider using a multiset instead of a set.
-
Re: compare two files
adding to monarch_dodra advice, given that there are no spaces or tabs in your file you can use boost::transform_iterator to fill the set<> quickly:
Code:
struct Filter
{
line operator()( const std::string& s ) const
{
// build a line instance
}
};
typedef boost::transform_iterator< Filter, std::istream_iterator<std::string> > line_iterator;
std::set<line> first_set( line_iterator( first_file, Filter() ), line_iterator() );
std::set<line> second_set( line_iterator( second_file, Filter() ), line_iterator() );
// ...
-
Re: compare two files
Thanks monarch_dodra and superbonzo.
superbonzo I havent programmed in Boost before so I have no idea. I am also new to STL :D. Recent graduate and have started my job a month ago :P so please do help me.
This is exactly the block of data I am looking for duplicates in.
Code:
BREACH:0:40:GE:20100601-07:34:22.796
0:1:ORDER ID:0000D9DB
0:2:ORDER ID:0000D9DC
0:1:TRDR:GRC
0:2:TRDR:GRC
0:0:TRADE CROSSING IDS:14ijpol1vsu7l7,yfrea51dwd8jx
0:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
0:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
0:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140
0:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140
0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34
0:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35
0:1:CME,20100601,07:26:15.306,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34
0:1:CME,20100601,07:26:15.322,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34
0:1:CME,20100601,07:26:15.884,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34
0:1:CME,20100601,07:26:16.915,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35
0:2:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,d88hmz15psx80,Leg Fill,0000D9DC,S,00000,14,1,99.155,20100601,07:27:34
0:2:CME,20100601,07:34:22.796,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,91218jx96avv,Leg Fill,0000D9DC,B,00000,14,1,99.015,20100601,07:27:34
0:2:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,6wek8owhn5su,Leg Fill,0000D9DC,S,00000,14,8,99.155,20100601,07:27:34
0:2:CME,20100601,07:34:22.796,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,1csjxtcsg9su,Leg Fill,0000D9DC,B,00000,14,8,99.015,20100601,07:27:34
0:2:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,yfrea51dwd8jx,Leg Fill,0000D9DC,S,00000,14,2,99.155,20100601,07:27:34
0:2:CME,20100601,07:34:22.796,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,vn6pl3dkevk0,Leg Fill,0000D9DC,B,00000,14,2,99.015,20100601,07:27:34
0:2:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,1ry0p8q124cwl,Leg Fill,0000D9DC,S,00000,14,1,99.155,20100601,07:27:34
0:2:CME,20100601,07:34:22.796,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,15n670p1xopt6q,Leg Fill,0000D9DC,B,00000,14,1,99.015,20100601,07:27:34
0:2:CME,20100601,07:26:15.322,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,d88hmz15psx80,Leg Fill,0000D9DC,S,00000,14,1,99.155,20100601,07:27:34
0:2:CME,20100601,07:26:15.322,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,91218jx96avv,Leg Fill,0000D9DC,B,00000,14,1,99.015,20100601,07:27:34
0:2:CME,20100601,07:26:15.322,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,6wek8owhn5su,Leg Fill,0000D9DC,S,00000,14,8,99.155,20100601,07:27:34
0:2:CME,20100601,07:26:15.322,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,1csjxtcsg9su,Leg Fill,0000D9DC,B,00000,14,8,99.015,20100601,07:27:34
0:2:CME,20100601,07:26:15.322,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,yfrea51dwd8jx,Leg Fill,0000D9DC,S,00000,14,2,99.155,20100601,07:27:34
0:2:CME,20100601,07:26:15.322,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,vn6pl3dkevk0,Leg Fill,0000D9DC,B,00000,14,2,99.015,20100601,07:27:34
0:2:CME,20100601,07:26:15.322,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,1ry0p8q124cwl,Leg Fill,0000D9DC,S,00000,14,1,99.155,20100601,07:27:34
0:2:CME,20100601,07:26:15.322,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,15n670p1xopt6q,Leg Fill,0000D9DC,B,00000,14,1,99.015,20100601,07:27:34
0:0:WASH-ORD-TIME-DIFF,2.703
0:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693
I am not going to do anything to these lines which are at the beginning of the data block. They should remain the same as they are
Code:
BREACH:0:40:GE:20100601-07:34:22.796
0:1:ORDER ID:0000D9DB
0:2:ORDER ID:0000D9DC
0:1:TRDR:GRC
0:2:TRDR:GRC
0:0:TRADE CROSSING IDS:14ijpol1vsu7l7,yfrea51dwd8jx
0:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
0:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155
0:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140
0:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140
and these lines at the end should also remain the same
Code:
1:0:WASH-ORD-TIME-DIFF,2.703
1:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693
I need to find the duplicates among these lines which has CME in it which comes right after the 2nd colon(:)
Code:
1:2:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,d88hmz15psx80,Leg Fill,0000D9DC,S,00000,14,1,99.155,20100601,07:27:34
and the substring for duplicates start from the position right after the third comma (,). The string starting from brginning of the above line till first should also remain the same.
Is there any way I can get to the lines having CME without changing the lines above and below? I may be able to remove duplicates afterwards which would actually reduce the number of CME lines
-
Re: compare two files
correction:
and the substring for duplicates start from the position right after the third comma (,). The string starting from beginning of the above line till first COMMA (,) should also remain the same.
-
Re: compare two files
The basic idea : read each file into set<string> and compare each set
as previous suggested.
Here is a sample code to read the file into a set ...
Code:
void ReadRecords(const char * fname , set<string> & records)
{
// add records containing "CME," into the set.
// this basically removes everything after the first field
// (keeping the ":") up to an including the third comma from
// each line.
// the subsetted line is added to the records set.
ifstream in(fname);
string line , first_field;
while ( getline(in,line) )
{
size_t pos = line.find("CME,");
if (pos != string::npos)
{
stringstream ss(line); // place in a stringstream to ease parsing the line
getline(ss,first_field,':');
const int n_commas = 3; // number of commas to skip
for (int i=0; i<n_commas; ++i)
{
getline(ss,line,',');
}
getline(ss,line); // rest of line after the "n_commas" comma
records.insert( first_field + ":" + line );
}
}
}
// ...
set<string> file1_records;
ReadRecords("first_file.txt",file1_records);