CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 10 of 10
  1. #1
    Join Date
    Jul 2009
    Posts
    40

    how to find duplicate lines in multiple files.

    Hi, I'm trying to figure out how this could work I have:

    FILE A, B, C, D
    they all have around 1000000+ lines.
    I need a code that will find duplicate lines in either files, and write to a file which file has a duplicated line in this form:

    FOUND DUPLICATE LINE >> filename1:line - filename2:line
    it can also be:
    FOUND DUPLICATE LINE >> filename1:line - filename1:line

    thanks in advance, I've tried methods but they are very slow.
    we are aiming a 40-60 minutes max procedure.

  2. #2
    Join Date
    Jan 2009
    Posts
    1,689

    Re: how to find duplicate lines in multiple files.

    4 million lines took an hour? That doesn't sound right at all. I can't think of an algorithm that will beat the brute force way of going through them all. I think the way you're checking is the problem.

    Load the entire file in first, then tokenize it via line, then do a loop to check all lines against all other lines. I don't think this should take your program more than a minute.

    My assumption is that you are using getline or something. While getline does use a buffer, if each file is only a million lines, you can certainly just read the whole thing in in one go.

  3. #3
    Join Date
    Aug 2000
    Location
    West Virginia
    Posts
    7,721

    Re: how to find duplicate lines in multiple files.

    A million reasonably size lines should only take a few seconds to process.

    If you can read one of the files into memory:

    1) create a map<string,int> , where the "string" is the line and the "int" is
    the line number in the first file.

    2) read the first file line by line, putting the data into the map

    3) read the second file line by line ... for each line use the map to see
    if it was in the first file and the line number if it was.

  4. #4
    Join Date
    Jun 2009
    Location
    France
    Posts
    2,513

    Re: how to find duplicate lines in multiple files.

    Quote Originally Posted by ninja9578 View Post
    4 million lines took an hour? That doesn't sound right at all. I can't think of an algorithm that will beat the brute force way of going through them all. I think the way you're checking is the problem.

    Load the entire file in first, then tokenize it via line, then do a loop to check all lines against all other lines. I don't think this should take your program more than a minute.

    My assumption is that you are using getline or something. While getline does use a buffer, if each file is only a million lines, you can certainly just read the whole thing in in one go.
    I agree with buffering the file, rather than get lining 1000000 times.

    However, brute force DOES take n^2 iterations, which is A LOT when n is big.

    Something really simple, that could lower the complexity to nlog(n) would be to store the lines in a (multi)set. Duplicates are SO MUCH EASIER to find in a sorted container.

    That or sort your container. Sorting is nlog(n), so even that should be better than brute force.

    You could even lower the complexity to n (but higher worst case), if you use a hash table. The advantage of using a hash table is that you could store only the hash value, and afterwards check for false positives.

    The choice is yours in trade off development time*complexity/speed of program

    Regardless of your choice though, this should NOT take 45 minutes unless you choose some really retarded algorithm. Even the simplest of algorithms shouldn't take that long.
    Is your question related to IO?
    Read this C++ FAQ article at parashift by Marshall Cline. In particular points 1-6.
    It will explain how to correctly deal with IO, how to validate input, and why you shouldn't count on "while(!in.eof())". And it always makes for excellent reading.

  5. #5
    Join Date
    Jan 2009
    Posts
    1,689

    Re: how to find duplicate lines in multiple files.

    Actually, the brute force method takes S(n).

    But okay, sorting does sound better.

  6. #6
    Join Date
    Jul 2009
    Posts
    40

    Re: how to find duplicate lines in multiple files.

    Quote Originally Posted by monarch_dodra View Post
    I agree with buffering the file, rather than get lining 1000000 times.

    However, brute force DOES take n^2 iterations, which is A LOT when n is big.

    Something really simple, that could lower the complexity to nlog(n) would be to store the lines in a (multi)set. Duplicates are SO MUCH EASIER to find in a sorted container.

    That or sort your container. Sorting is nlog(n), so even that should be better than brute force.

    You could even lower the complexity to n (but higher worst case), if you use a hash table. The advantage of using a hash table is that you could store only the hash value, and afterwards check for false positives.

    The choice is yours in trade off development time*complexity/speed of program

    Regardless of your choice though, this should NOT take 45 minutes unless you choose some really retarded algorithm. Even the simplest of algorithms shouldn't take that long.
    well can you show me a small example of what algorithm you would use?
    I did use getline() it's not reading the file that's long it's the process of finding duplicate lines from a 1 millions line file with another 1 million line. since it has to search 1m * 1m times to find the duplicates.

    I did read on hash table though not in C++ in C# that's what i was reading before I saw answers over here.

    btw each line is about 300 chars.
    Last edited by dsylebee; July 21st, 2010 at 09:26 AM.

  7. #7
    Join Date
    Aug 2000
    Location
    West Virginia
    Posts
    7,721

    Re: how to find duplicate lines in multiple files.

    Here is an (untested) example using a map. A hash table would be slightly faster,
    but most of the time will be spent in IO.

    Code:
    #include <string>
    #include <map>
    #include <fstream>
    
    void Intersection(const char * input1 , const char * input2 , const char * output)
    {
      using namespace std;
      
      map<string,int> m;
      
      ifstream in(input1);
      
      string line;
     
      int line_count = 0;
      while (getline(in,line)) 
      {
        m[line] = line_count++;
      }
      
    
      ifstream in2(input2);
      ofstream out(output);
     
      line_count = 0;
      while (getline(in2,line))
      {
         map<string,int>::iterator it = m.find(line);
         if (it != m.end())
         {
           out << it->second << " : " << line << "\n";
           ++line_count;
         }
      }
    }

  8. #8
    Join Date
    Jul 2009
    Posts
    40

    Re: how to find duplicate lines in multiple files.

    Quote Originally Posted by Philip Nicoletti View Post
    Here is an (untested) example using a map. A hash table would be slightly faster,
    but most of the time will be spent in IO.

    Code:
    #include <string>
    #include <map>
    #include <fstream>
    
    void Intersection(const char * input1 , const char * input2 , const char * output)
    {
      using namespace std;
      
      map<string,int> m;
      
      ifstream in(input1);
      
      string line;
     
      int line_count = 0;
      while (getline(in,line)) 
      {
        m[line] = line_count++;
      }
      
    
      ifstream in2(input2);
      ofstream out(output);
     
      line_count = 0;
      while (getline(in2,line))
      {
         map<string,int>::iterator it = m.find(line);
         if (it != m.end())
         {
           out << it->second << " : " << line << "\n";
           ++line_count;
         }
      }
    }
    this won't change the file I'm reading right?

  9. #9
    Join Date
    Jun 2009
    Location
    France
    Posts
    2,513

    Re: how to find duplicate lines in multiple files.

    A similiar example, albeit it doesn't do exactly the same thing (your requirements are not that specific)

    Code:
    #include <iostream>
    #include <string>
    #include <fstream>
    #include <map>
    
    using namespace std;
    
    struct lineInfo
    {
        string filename;
        int lineNumber;
    };
    
    ostream& operator<<(ostream& oOstream, const lineInfo& iLineInfo)
    {
        oOstream << iLineInfo.filename << " at line " << iLineInfo.lineNumber;
        return oOstream;
    }
    
    int main()
    {
        map<string, lineInfo> lineMap;
    
        string fileName = "test.txt";
        ifstream myFile(fileName.c_str());
    
        string line;
        int line_number = 1;
        while(getline(myFile, line))
        {
            lineInfo currentLineInfo = {fileName, line_number};
            pair<string, lineInfo> newObject(line, currentLineInfo);
            pair<map<string, lineInfo>::iterator, bool> result = lineMap.insert(newObject);
    
            if (result.second == false)
            {
                std::cout << "\"" << line << "\"" << " in " << currentLineInfo << " was already in " << (*result.first).second << "." << std::endl;
            }
            ++line_number;
        }
    
        return 0;
    }
    It's quite ugly. I should have used more typedefs and functions, but that's how it is. I tested it by the way.

    The method has some inefficiencies because it copies the string into the map every time. It could probably go faster with shallow char pointers.

    That said, I'm sure there's more to gain by finding a better algorithm, than tweaking the implementation.

    It took about 2 seconds to process 1,200,000 lines. I wouldn't bother too much making the algorithm any faster, I'd just make it clear, maintainable, and expandable.
    Is your question related to IO?
    Read this C++ FAQ article at parashift by Marshall Cline. In particular points 1-6.
    It will explain how to correctly deal with IO, how to validate input, and why you shouldn't count on "while(!in.eof())". And it always makes for excellent reading.

  10. #10
    Join Date
    Jul 2009
    Posts
    40

    Re: how to find duplicate lines in multiple files.

    Quote Originally Posted by monarch_dodra View Post
    A similiar example, albeit it doesn't do exactly the same thing (your requirements are not that specific)

    Code:
    #include <iostream>
    #include <string>
    #include <fstream>
    #include <map>
    
    using namespace std;
    
    struct lineInfo
    {
        string filename;
        int lineNumber;
    };
    
    ostream& operator<<(ostream& oOstream, const lineInfo& iLineInfo)
    {
        oOstream << iLineInfo.filename << " at line " << iLineInfo.lineNumber;
        return oOstream;
    }
    
    int main()
    {
        map<string, lineInfo> lineMap;
    
        string fileName = "test.txt";
        ifstream myFile(fileName.c_str());
    
        string line;
        int line_number = 1;
        while(getline(myFile, line))
        {
            lineInfo currentLineInfo = {fileName, line_number};
            pair<string, lineInfo> newObject(line, currentLineInfo);
            pair<map<string, lineInfo>::iterator, bool> result = lineMap.insert(newObject);
    
            if (result.second == false)
            {
                std::cout << "\"" << line << "\"" << " in " << currentLineInfo << " was already in " << (*result.first).second << "." << std::endl;
            }
            ++line_number;
        }
    
        return 0;
    }
    It's quite ugly. I should have used more typedefs and functions, but that's how it is. I tested it by the way.

    The method has some inefficiencies because it copies the string into the map every time. It could probably go faster with shallow char pointers.

    That said, I'm sure there's more to gain by finding a better algorithm, than tweaking the implementation.

    It took about 2 seconds to process 1,200,000 lines. I wouldn't bother too much making the algorithm any faster, I'd just make it clear, maintainable, and expandable.
    nvm it should work thanks ill tweak with the typedefs my self
    Last edited by dsylebee; July 21st, 2010 at 10:46 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured