-
November 1st, 2010, 09:04 AM
#1
finding duplicates for part of a string
Hello everyone
I am trying to find duplicates for part of a string, not the whole string. The strings are stored in a file. Each line of file contains a string and many of which looks something like this (not all of the lines).
Code:
0:1:CME,20100601,14:07:53.375,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
where the '0' at the very beginning is common throughout a block of lines. The other block will have a '1' common through out the block and so on. The string starting from CCD untill the end can be duplicated and I have to find how many such duplicate lines are there against each '0' and '1' and so on. The file can contain any combination of any strings, not just the one mentioned in the above example but if at all it contains duplicates then the string starting from position of 'C' of the 'CCD' till the end would be repeated.
After I find the duplicates. I have to compare it with the other file which contains all unique strings extracted from the first file that is having duplicates. I actually want to know if the file having the non-duplicate values contains all strings that appear in the first file (with duplicates). I want to make sure that all of the strings have been extracted uniquely and stored in the other file (with unique values)
Can anyone please help. Would be grateful.
-
November 1st, 2010, 09:20 AM
#2
Re: finding duplicates for part of a string
Comparing a substring is easy; if the entire string is stored in a std::string s, then do
Code:
s.substr(s.find("CCD"))
and compare that to what you're looking for.
-
November 1st, 2010, 09:21 AM
#3
Re: finding duplicates for part of a string
It is not clear exactly what you need. maybe a multimap ?
Example:
Code:
struct Key
{
string field1; // "0" in your example
string field2; // "1" in your example
bool operator < (const Key & rhs) const
{
if (field1 < rhs.field1) return true;
if (field1 > rhs.field2) return false;
return field2 < rhs.field2;
}
};
//
multimap<key,string> filemap;
// where the "value" part of the map is the "rest of the line" in your example
// 0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
I think more explanation is needed ... with example of lines that you would
want to flag as duplicates.
-
November 1st, 2010, 09:26 AM
#4
Re: finding duplicates for part of a string
could you please elaborate. The CCD can be anything. I cannot write it in quotes. The string stating from the position of 'C' of the CCD till the end would be duplicated. It is actually the position, not the CCD. And if one instance of CCD till the end appears once with the starting string '0' and the next instance appears in some other line which starts from any digit other than '0' then it would not be considered a duplicate. I hope I have made my point clear.
-
November 1st, 2010, 09:29 AM
#5
Re: finding duplicates for part of a string
Below are some more lines from the file
Code:
0:1:CME,20100601,14:07:53.375,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
0:1:CME,20100601,13:59:45.556,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1oilw3g1bvmoyo,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
0:1:CME,20100601,13:59:45.150,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1oyr7uiubtx0l,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
0:1:CME,20100601,13:59:45.165,CCD,GE,201009,FUT,XGCCD,0G4L7D294,v09wp1112gneo,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
1:2:CME,20100601,10:22:33.078,CFD,GE,201106,FUT,XGCFD,0G4LGP101,5o5wv61n8ds4w,Leg Fill,0000DA3F,S,00000,18.5,2,98.675,20100601,10:15:44
1:2:CME,20100601,10:14:25.275,CFD,GE,201106,FUT,XGCFD,0G4LGP101,7ga0hh1psbfa5,Leg Fill,0000DA3F,S,00000,18.5,2,98.675,20100601,10:15:44
1:2:CME,20100601,10:22:33.078,CFD,GE,201106,FUT,XGCFD,0G4LGP101,a46o111s2tk45,Leg Fill,0000DA3F,S,00000,18.5,3,98.675,20100601,10:15:44
1:2:CME,20100601,10:22:33.046,CFD,GE,201106,FUT,XGCFD,0G4LGP101,k13xp1tfotis,Leg Fill,0000DA3F,S,00000,18.5,4,98.675,20100601,10:15:44
50:1:CME,20100601,12:07:24.384,DCE,GE,201109,FUT,XGDCE,0G4LCN103,mge46sg0pe1k,Fill,0001VVCH,S,00000,98.48,1,98.48,20100601,12:08:43
50:2:CME,20100601,12:07:24.384,GVM,GE,201109,FUT,XGGVM,0G4L9J144,14xceud1jgrquv,Leg Fill,0001W2UK,B,00000,24,2,98.48,20100601,12:08:43
50:2:CME,20100601,12:15:32.390,GVM,GE,201109,FUT,XGGVM,0G4L9J144,1hchm0pwgw0ip,Leg Fill,0001W2UK,B,00000,24,6,98.48,20100601,12:08:43
50:2:CME,20100601,12:07:24.415,GVM,GE,201109,FUT,XGGVM,0G4L9J144,1igm2qn1djr0ry,Leg Fill,0001W2UK,B,00000,24,1,98.48,20100601,12:08:43
-
November 1st, 2010, 09:36 AM
#6
Re: finding duplicates for part of a string
1) substr() takes a position ... so that is not a problem ... you just need to make
sure that the line actually contains enough characters.
2) so it looks like a simple map<string,string> will work
where the "key" is the rest of the line and the :"value" is the "0" field.
-
November 1st, 2010, 09:37 AM
#7
Re: finding duplicates for part of a string
Originally Posted by heidiK
could you please elaborate. The CCD can be anything. I cannot write it in quotes.
The point was to specify the point in the line where the repeated segment begins. Perhaps you need to parse the line looking for delimiters in order to identify that point instead?
-
November 1st, 2010, 09:39 AM
#8
Re: finding duplicates for part of a string
Thanks Philips. Could you please write to me an example because I am totally new to STL and map is an STL container.
-
November 1st, 2010, 09:41 AM
#9
Re: finding duplicates for part of a string
Lindley! the position is 31 in the string from where onwards a string can appear more than once in the file within a single block. The block can start from any digit (stored as string) but no two blocks can have a same starting digit (or line ID)
-
November 1st, 2010, 10:03 AM
#10
Re: finding duplicates for part of a string
Originally Posted by heidiK
Lindley! the position is 31 in the string from where onwards a string can appear more than once in the file within a single block. The block can start from any digit (stored as string) but no two blocks can have a same starting digit (or line ID)
Oh, well, if you know the position in the string where the common portion starts, then just use that directly. Put it in a const variable though----"magic numbers" like 31 should be limited as much as possible in your code.
-
November 1st, 2010, 10:09 AM
#11
Re: finding duplicates for part of a string
Is it always 31 ? It seems like it would depend on the number of characters
in the first (and maybe second) field.
example from your post
Code:
1:2:CME,20100601,10:22:33.046,CFD,GE,201106,FUT,XGCFD,0G4LGP101,k13xp1tfotis,Leg Fill,0000DA3F,S,00000,18.5,4,98.675,20100601,10:15:44
50:1:CME,20100601,12:07:24.384,DCE,GE,201109,FUT,XGDCE,0G4LCN103,mge46sg0pe1k,Fill,0001VVCH,S,00000,98.48,1,98.48,20100601,12:08:43
What is the start of the code ?
-
November 1st, 2010, 10:18 AM
#12
Re: finding duplicates for part of a string
I have not started writing code for this part yet. I have just been able to extract duplicate from the actual file based on some condition and finally stored all unique strings in a different file. Now I want to compare the two files: the original file having duplicates and the new file with unique strings. Could you please help me with a small spiace of code?
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|