CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 12 of 12
  1. #1
    Join Date
    Oct 2010
    Posts
    106

    finding duplicates for part of a string

    Hello everyone

    I am trying to find duplicates for part of a string, not the whole string. The strings are stored in a file. Each line of file contains a string and many of which looks something like this (not all of the lines).

    Code:
    0:1:CME,20100601,14:07:53.375,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    where the '0' at the very beginning is common throughout a block of lines. The other block will have a '1' common through out the block and so on. The string starting from CCD untill the end can be duplicated and I have to find how many such duplicate lines are there against each '0' and '1' and so on. The file can contain any combination of any strings, not just the one mentioned in the above example but if at all it contains duplicates then the string starting from position of 'C' of the 'CCD' till the end would be repeated.

    After I find the duplicates. I have to compare it with the other file which contains all unique strings extracted from the first file that is having duplicates. I actually want to know if the file having the non-duplicate values contains all strings that appear in the first file (with duplicates). I want to make sure that all of the strings have been extracted uniquely and stored in the other file (with unique values)

    Can anyone please help. Would be grateful.

  2. #2
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: finding duplicates for part of a string

    Comparing a substring is easy; if the entire string is stored in a std::string s, then do
    Code:
    s.substr(s.find("CCD"))
    and compare that to what you're looking for.

  3. #3
    Join Date
    Aug 2000
    Location
    West Virginia
    Posts
    7,721

    Re: finding duplicates for part of a string

    It is not clear exactly what you need. maybe a multimap ?

    Example:

    Code:
    struct Key
    {
       string field1;  // "0" in your example
       string field2;  // "1" in your example
    
       bool operator < (const Key & rhs) const
       {
          if (field1 < rhs.field1) return true;
          if (field1 > rhs.field2) return false;
    
          return field2 < rhs.field2;
      }
    };
    
    //
    
    multimap<key,string> filemap;
    
    // where the "value" part of the map is the "rest of the line" in your example
    // 0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    I think more explanation is needed ... with example of lines that you would
    want to flag as duplicates.

  4. #4
    Join Date
    Oct 2010
    Posts
    106

    Re: finding duplicates for part of a string

    could you please elaborate. The CCD can be anything. I cannot write it in quotes. The string stating from the position of 'C' of the CCD till the end would be duplicated. It is actually the position, not the CCD. And if one instance of CCD till the end appears once with the starting string '0' and the next instance appears in some other line which starts from any digit other than '0' then it would not be considered a duplicate. I hope I have made my point clear.

  5. #5
    Join Date
    Oct 2010
    Posts
    106

    Re: finding duplicates for part of a string

    Below are some more lines from the file

    Code:
    0:1:CME,20100601,14:07:53.375,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1ig53ov1n1qm3z,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    0:1:CME,20100601,13:59:45.556,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1oilw3g1bvmoyo,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    0:1:CME,20100601,13:59:45.150,CCD,GE,201009,FUT,XGCCD,0G4L7D294,1oyr7uiubtx0l,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    0:1:CME,20100601,13:59:45.165,CCD,GE,201009,FUT,XGCCD,0G4L7D294,v09wp1112gneo,Leg Fill,00006L3W,S,00000,2,1,99.175,20100601,14:01:04
    
    1:2:CME,20100601,10:22:33.078,CFD,GE,201106,FUT,XGCFD,0G4LGP101,5o5wv61n8ds4w,Leg Fill,0000DA3F,S,00000,18.5,2,98.675,20100601,10:15:44
    1:2:CME,20100601,10:14:25.275,CFD,GE,201106,FUT,XGCFD,0G4LGP101,7ga0hh1psbfa5,Leg Fill,0000DA3F,S,00000,18.5,2,98.675,20100601,10:15:44
    1:2:CME,20100601,10:22:33.078,CFD,GE,201106,FUT,XGCFD,0G4LGP101,a46o111s2tk45,Leg Fill,0000DA3F,S,00000,18.5,3,98.675,20100601,10:15:44
    1:2:CME,20100601,10:22:33.046,CFD,GE,201106,FUT,XGCFD,0G4LGP101,k13xp1tfotis,Leg Fill,0000DA3F,S,00000,18.5,4,98.675,20100601,10:15:44
    
    50:1:CME,20100601,12:07:24.384,DCE,GE,201109,FUT,XGDCE,0G4LCN103,mge46sg0pe1k,Fill,0001VVCH,S,00000,98.48,1,98.48,20100601,12:08:43
    50:2:CME,20100601,12:07:24.384,GVM,GE,201109,FUT,XGGVM,0G4L9J144,14xceud1jgrquv,Leg Fill,0001W2UK,B,00000,24,2,98.48,20100601,12:08:43
    50:2:CME,20100601,12:15:32.390,GVM,GE,201109,FUT,XGGVM,0G4L9J144,1hchm0pwgw0ip,Leg Fill,0001W2UK,B,00000,24,6,98.48,20100601,12:08:43
    50:2:CME,20100601,12:07:24.415,GVM,GE,201109,FUT,XGGVM,0G4L9J144,1igm2qn1djr0ry,Leg Fill,0001W2UK,B,00000,24,1,98.48,20100601,12:08:43

  6. #6
    Join Date
    Aug 2000
    Location
    West Virginia
    Posts
    7,721

    Re: finding duplicates for part of a string

    1) substr() takes a position ... so that is not a problem ... you just need to make
    sure that the line actually contains enough characters.

    2) so it looks like a simple map<string,string> will work

    where the "key" is the rest of the line and the :"value" is the "0" field.

  7. #7
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: finding duplicates for part of a string

    Quote Originally Posted by heidiK View Post
    could you please elaborate. The CCD can be anything. I cannot write it in quotes.
    The point was to specify the point in the line where the repeated segment begins. Perhaps you need to parse the line looking for delimiters in order to identify that point instead?

  8. #8
    Join Date
    Oct 2010
    Posts
    106

    Re: finding duplicates for part of a string

    Thanks Philips. Could you please write to me an example because I am totally new to STL and map is an STL container.

  9. #9
    Join Date
    Oct 2010
    Posts
    106

    Re: finding duplicates for part of a string

    Lindley! the position is 31 in the string from where onwards a string can appear more than once in the file within a single block. The block can start from any digit (stored as string) but no two blocks can have a same starting digit (or line ID)

  10. #10
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: finding duplicates for part of a string

    Quote Originally Posted by heidiK View Post
    Lindley! the position is 31 in the string from where onwards a string can appear more than once in the file within a single block. The block can start from any digit (stored as string) but no two blocks can have a same starting digit (or line ID)
    Oh, well, if you know the position in the string where the common portion starts, then just use that directly. Put it in a const variable though----"magic numbers" like 31 should be limited as much as possible in your code.

  11. #11
    Join Date
    Aug 2000
    Location
    West Virginia
    Posts
    7,721

    Re: finding duplicates for part of a string

    Is it always 31 ? It seems like it would depend on the number of characters
    in the first (and maybe second) field.

    example from your post

    Code:
    1:2:CME,20100601,10:22:33.046,CFD,GE,201106,FUT,XGCFD,0G4LGP101,k13xp1tfotis,Leg Fill,0000DA3F,S,00000,18.5,4,98.675,20100601,10:15:44
    
    50:1:CME,20100601,12:07:24.384,DCE,GE,201109,FUT,XGDCE,0G4LCN103,mge46sg0pe1k,Fill,0001VVCH,S,00000,98.48,1,98.48,20100601,12:08:43
    What is the start of the code ?

  12. #12
    Join Date
    Oct 2010
    Posts
    106

    Re: finding duplicates for part of a string

    I have not started writing code for this part yet. I have just been able to extract duplicate from the actual file based on some condition and finally stored all unique strings in a different file. Now I want to compare the two files: the original file having duplicates and the new file with unique strings. Could you please help me with a small spiace of code?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured