compare two files

Printable View

Show 50 post(s) from this thread on one page

November 1st, 2010, 11:52 AM
heidiK

compare two files

Could anyone please help me how to compare 2 files containing lines of string? The two files contain similar data except that one contains data with duplicates while the other one contains all unique. I have to make sure if the file having unique data contains all the lines present in the file containing duplicates.

Both files look something like this and I have to compare only the lines which has the sub string CME in it which comes right before the first coma in the string

Code:

BREACH:0:40:GE:20100601-07:34:22.796 0:1:ORDER ID:0000D9DB 0:2:ORDER ID:0000D9DC 0:1:TRDR:GRC 0:2:TRDR:GRC 0:0:TRADE CROSSING IDS:14ijpol1vsu7l7,yfrea51dwd8jx 0:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 0:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 0:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140 0:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140 0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35 0:0:WASH-ORD-TIME-DIFF,2.703 0:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693 BREACH:1:40:GE:20100601-07:34:22.796 1:1:ORDER ID:0000D9DB 1:2:ORDER ID:0000D9DC 1:1:TRDR:GRC 1:2:TRDR:GRC 1:0:TRADE CROSSING IDS:1iyucih1bcmpso,d88hmz15psx80 1:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 1:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 1:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140 1:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140 1:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34 1:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34 1:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34 1:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35 1:0:WASH-ORD-TIME-DIFF,2.703 1:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693 BREACH:40:40:GE:20100601-14:08:35.406 40:1:ORDER ID:0000D9XN 40:2:ORDER ID:0000DBHJ 40:1:TRDR:DAF 40:2:TRDR:DAF 40:0:TRADE CROSSING IDS:4hr6iu6smidw,1t6btger8juyg 40:1:OrderReceive,01.06.2010 09:58:50.031,0323YK058,0000D9XN,A,25,0,25,0,-35 40:1:OrderReceive,01.06.2010 09:50:42.290,0323YK058,0000D9XN,A,25,0,25,0,-35 40:2:OrderReceive,01.06.2010 14:07:29.062,0323YK153,0000DBHJ,A,7,0,7,0,160 40:2:OrderReceive,01.06.2010 13:59:20.853,0323YK153,0000DBHJ,A,7,0,7,0,160 40:1:CME,20100601,12:45:46.250,DAF,GE,201012,FUT,XGDAF,0323YK058,1kxdklb1oghepj,Leg Fill,0000D9XN,B,00000,-3.5,2,99,20100601,12:38:57 40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,1v0f5tr1da5176,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57 40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,61wcagnnjmjl,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57 40:1:CME,20100601,12:45:46.265,DAF,GE,201106,FUT,XGDAF,0323YK058,v6d4b2agp1hr,Leg Fill,0000D9XN,B,00000,-3.5,2,98.715,20100601,12:38:57 40:0:WASH-ORD-TIME-DIFF,14918.6 40:0:2nd-ORD-TO-WASH-TIME-DIFF,554.553 BREACH:101:30:GE:20100601-07:18:05.015 101:1:ORDER ID:0001U8QR 101:2:ORDER ID:0001W0PJ 101:1:TRDR:MTJ 101:2:TRDR:FDC 101:0:TRADE CROSSING IDS:1ua0o7twia2cx,p3mqxj1it2iao 101:1:OrderReceive,01.06.2010 07:18:05.015,082X7Y007,0001U8QR,A,1,0,1,0,99025 101:1:OrderReceive,01.06.2010 07:09:57.556,082X7Y007,0001U8QR,A,1,0,1,0,99025 101:2:OrderReceive,01.06.2010 07:18:04.468,0323X8076,0001W0PJ,A,10,0,10,0,145 101:2:OrderReceive,01.06.2010 07:09:57.009,0323X8076,0001W0PJ,A,10,0,10,0,145 101:1:CME,20100601,07:18:05.015,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16 101:1:CME,20100601,07:09:57.556,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16 101:2:CME,20100601,07:18:04.468,FDC,GE,201009,FUT,XGFDC,0323X8076,1t6wkc41c2ki0m,Leg Fill,0001W0PJ,S,00000,14.5,3,99.17,20100601,07:11:15 101:2:CME,20100601,07:18:04.468,FDC,GE,201012,FUT,XGFDC,0323X8076,15vxdjj1r1imja,Leg Fill,0001W0PJ,B,00000,14.5,3,99.025,20100601,07:11:15 101:0:CROSS-ORD-TIME-DIFF,0.547
November 1st, 2010, 04:10 PM
VladimirF

Re: compare two files

Are these files sorted? At least the entries that you are interested in?
Is it OK for the file with unique entries to contain the stuff not in the second file?
Do entire lines have to match, or just some part of it?
November 1st, 2010, 08:18 PM
aamir121a

Re: compare two files

head over to C++ boost
www.boost.org , then read on spirit , it a parser for C++ that will help you create parser to do just about anything.
November 2nd, 2010, 04:31 AM
heidiK

Re: compare two files

Vladmir.
I think the entries I am interested are sorted in the file that contains unique values but are not sorted in the one that also contains duplicates in it.
The file with unique entries has actually been extracted from the file with duplicate values and I want make certain that unique file contains all the entries from the duplicate file.
Not entire lines but just some part of the string should match starting from GRC till the end

Code:

GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34

Many thanks
November 2nd, 2010, 06:20 AM
Philip Nicoletti

Re: compare two files

There are many lines in your sample that have CME, in them ... but do not have GRC
November 2nd, 2010, 06:38 AM
Yves M

Re: compare two files

You could use unix commands for this.

Code:

grep ':CME,' filename | sed -e 's/expression to parse out what you want/\1/' | uniq

The grep selects the lines you're interested in from the source file.
The sed takes only the part of the line that has too be compared
uniq deletes duplicates
and then use diff to find differences

If you want to do it in C++, you have the same steps to do. Apart from the fact that programming a good diff program is not trivial (how do you know when the files re-synchronize?).
November 2nd, 2010, 11:03 AM
heidiK

Re: compare two files

Philip Nicoletti it is not the GRC, it is actually the position at which this three letter word occurs.
November 2nd, 2010, 11:58 AM
Philip Nicoletti

Re: compare two files

What is the position ... consider these 3 lines. How do you
determine the position to start for comparinson:

Code:

0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34 40:1:CME,20100601,12:45:46.250,DAF,GE,201103,FUT,XGDAF,0323YK058,61wcagnnjmjl,Leg Fill,0000D9XN,S,00000,-3.5,2,98.875,20100601,12:38:57 101:1:CME,20100601,07:18:05.015,MTJ,GE,201012,FUT,XGMTJ,082X7Y007,1ua0o7twia2cx,Fill,0001U8QR,S,00000,99.025,1,99.025,20100601,07:11:16
November 3rd, 2010, 02:19 AM
monarch_dodra

Re: compare two files

Is this EDIFACT by any chance?

Anyway, a simple sort on remove duplicate algorithm should do the work, the only problem is that we aren't quite sure what your definition of a duplicate is.

Could you give us an exact description of what you consider to be equal lines?
November 3rd, 2010, 05:50 AM
heidiK

Re: compare two files

Philip Nicoletti
It is the position after the third comma

monarch_dodra
Equal lines are those lines whose substring before the first colon (like 0: or 40: etc) and the substring that starts from the position after the 3rd comma from the beginning till the end of the line are exactly the same like:

Code:

0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34

which in this case is 0 and the string from GRC till the end.
November 3rd, 2010, 06:18 AM
monarch_dodra

Re: compare two files

This would be my approach:

First, use this struct:

Code:

struct line { std::string full_line; std::string sub_line; } bool operator<(const line& lhs, const line& rhs);

full_line is the entire reference line. Used only for reference.
sub_line is the substring.

you want to implement operator< to compare sub_line only.

Now, you can build an std::set<line>, that you fill with line structs 1 by 1. Do this for both files. This should give you 2 different sets at the end.

Once you have done this, you can use the following algorithms to compare your sets:
set_union Union of two sorted ranges (function template)
set_intersection Intersection of two sorted ranges (function template)
set_difference Difference of two sorted ranges (function template)
set_symmetric_difference Symmetric difference of two sorted ranges (function template)

In your case, you'd want to use set_difference to check the lines in your first file that aren't in your unique files.

Depending on your needs, you may also consider using a multiset instead of a set.
November 3rd, 2010, 06:38 AM
superbonzo

Re: compare two files

adding to monarch_dodra advice, given that there are no spaces or tabs in your file you can use boost::transform_iterator to fill the set<> quickly:

Code:

struct Filter { line operator()( const std::string& s ) const { // build a line instance } }; typedef boost::transform_iterator< Filter, std::istream_iterator<std::string> > line_iterator; std::set<line> first_set( line_iterator( first_file, Filter() ), line_iterator() ); std::set<line> second_set( line_iterator( second_file, Filter() ), line_iterator() ); // ...
November 3rd, 2010, 07:31 AM
heidiK

Re: compare two files

Thanks monarch_dodra and superbonzo.

superbonzo I havent programmed in Boost before so I have no idea. I am also new to STL :D. Recent graduate and have started my job a month ago :P so please do help me.

This is exactly the block of data I am looking for duplicates in.

Code:

BREACH:0:40:GE:20100601-07:34:22.796 0:1:ORDER ID:0000D9DB 0:2:ORDER ID:0000D9DC 0:1:TRDR:GRC 0:2:TRDR:GRC 0:0:TRADE CROSSING IDS:14ijpol1vsu7l7,yfrea51dwd8jx 0:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 0:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 0:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140 0:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140 0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:23.359,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34 0:1:CME,20100601,07:34:24.406,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35 0:1:CME,20100601,07:26:15.306,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,14ijpol1vsu7l7,Fill,0000D9DB,B,00000,99.155,2,99.155,20100601,07:27:34 0:1:CME,20100601,07:26:15.322,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1iyucih1bcmpso,Fill,0000D9DB,B,00000,99.155,1,99.155,20100601,07:27:34 0:1:CME,20100601,07:26:15.884,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,xl7a2t1fwh479,Fill,0000D9DB,B,00000,99.155,17,99.155,20100601,07:27:34 0:1:CME,20100601,07:26:16.915,GRC,GE,201009,FUT,XGGRC,0G4LHZ013,1ynu8nhiqkwyz,Fill,0000D9DB,B,00000,99.155,30,99.155,20100601,07:27:35 0:2:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,d88hmz15psx80,Leg Fill,0000D9DC,S,00000,14,1,99.155,20100601,07:27:34 0:2:CME,20100601,07:34:22.796,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,91218jx96avv,Leg Fill,0000D9DC,B,00000,14,1,99.015,20100601,07:27:34 0:2:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,6wek8owhn5su,Leg Fill,0000D9DC,S,00000,14,8,99.155,20100601,07:27:34 0:2:CME,20100601,07:34:22.796,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,1csjxtcsg9su,Leg Fill,0000D9DC,B,00000,14,8,99.015,20100601,07:27:34 0:2:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,yfrea51dwd8jx,Leg Fill,0000D9DC,S,00000,14,2,99.155,20100601,07:27:34 0:2:CME,20100601,07:34:22.796,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,vn6pl3dkevk0,Leg Fill,0000D9DC,B,00000,14,2,99.015,20100601,07:27:34 0:2:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,1ry0p8q124cwl,Leg Fill,0000D9DC,S,00000,14,1,99.155,20100601,07:27:34 0:2:CME,20100601,07:34:22.796,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,15n670p1xopt6q,Leg Fill,0000D9DC,B,00000,14,1,99.015,20100601,07:27:34 0:2:CME,20100601,07:26:15.322,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,d88hmz15psx80,Leg Fill,0000D9DC,S,00000,14,1,99.155,20100601,07:27:34 0:2:CME,20100601,07:26:15.322,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,91218jx96avv,Leg Fill,0000D9DC,B,00000,14,1,99.015,20100601,07:27:34 0:2:CME,20100601,07:26:15.322,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,6wek8owhn5su,Leg Fill,0000D9DC,S,00000,14,8,99.155,20100601,07:27:34 0:2:CME,20100601,07:26:15.322,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,1csjxtcsg9su,Leg Fill,0000D9DC,B,00000,14,8,99.015,20100601,07:27:34 0:2:CME,20100601,07:26:15.322,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,yfrea51dwd8jx,Leg Fill,0000D9DC,S,00000,14,2,99.155,20100601,07:27:34 0:2:CME,20100601,07:26:15.322,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,vn6pl3dkevk0,Leg Fill,0000D9DC,B,00000,14,2,99.015,20100601,07:27:34 0:2:CME,20100601,07:26:15.322,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,1ry0p8q124cwl,Leg Fill,0000D9DC,S,00000,14,1,99.155,20100601,07:27:34 0:2:CME,20100601,07:26:15.322,GRC,GE,201012,FUT,XGGRC,0G4LHZ014,15n670p1xopt6q,Leg Fill,0000D9DC,B,00000,14,1,99.015,20100601,07:27:34 0:0:WASH-ORD-TIME-DIFF,2.703 0:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693

I am not going to do anything to these lines which are at the beginning of the data block. They should remain the same as they are

Code:

BREACH:0:40:GE:20100601-07:34:22.796 0:1:ORDER ID:0000D9DB 0:2:ORDER ID:0000D9DC 0:1:TRDR:GRC 0:2:TRDR:GRC 0:0:TRADE CROSSING IDS:14ijpol1vsu7l7,yfrea51dwd8jx 0:1:OrderReceive,01.06.2010 07:34:07.875,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 0:1:OrderReceive,01.06.2010 07:26:00.400,0G4LHZ013,0000D9DB,A,50,0,50,0,99155 0:2:OrderReceive,01.06.2010 07:34:10.578,0G4LHZ014,0000D9DC,A,50,0,50,0,140 0:2:OrderReceive,01.06.2010 07:26:03.103,0G4LHZ014,0000D9DC,A,50,0,50,0,140

and these lines at the end should also remain the same

Code:

1:0:WASH-ORD-TIME-DIFF,2.703 1:0:2nd-ORD-TO-WASH-TIME-DIFF,499.693

I need to find the duplicates among these lines which has CME in it which comes right after the 2nd colon(:)

Code:

1:2:CME,20100601,07:34:22.796,GRC,GE,201009,FUT,XGGRC,0G4LHZ014,d88hmz15psx80,Leg Fill,0000D9DC,S,00000,14,1,99.155,20100601,07:27:34

and the substring for duplicates start from the position right after the third comma (,). The string starting from brginning of the above line till first should also remain the same.

Is there any way I can get to the lines having CME without changing the lines above and below? I may be able to remove duplicates afterwards which would actually reduce the number of CME lines
November 3rd, 2010, 07:33 AM
heidiK

Re: compare two files

correction:

and the substring for duplicates start from the position right after the third comma (,). The string starting from beginning of the above line till first COMMA (,) should also remain the same.
November 3rd, 2010, 07:37 AM
Philip Nicoletti

Re: compare two files

The basic idea : read each file into set<string> and compare each set
as previous suggested.

Here is a sample code to read the file into a set ...

Code:

void ReadRecords(const char * fname , set<string> & records) { // add records containing "CME," into the set. // this basically removes everything after the first field // (keeping the ":") up to an including the third comma from // each line. // the subsetted line is added to the records set. ifstream in(fname); string line , first_field; while ( getline(in,line) ) { size_t pos = line.find("CME,"); if (pos != string::npos) { stringstream ss(line); // place in a stringstream to ease parsing the line getline(ss,first_field,':'); const int n_commas = 3; // number of commas to skip for (int i=0; i<n_commas; ++i) { getline(ss,line,','); } getline(ss,line); // rest of line after the "n_commas" comma records.insert( first_field + ":" + line ); } } } // ... set<string> file1_records; ReadRecords("first_file.txt",file1_records);

Show 50 post(s) from this thread on one page