dcsimg
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 5 of 5

Thread: detecting duplicates in a vector<string>

  1. #1
    Join Date
    May 2009
    Location
    Boston
    Posts
    350

    detecting duplicates in a vector<string>

    Hello,

    Sorry for having multiple threads going but this is bugging me. I have a little function to check some input data for duplicates and it's not working.

    Code:
    // class for main data container
    class row_data {
    
    public:
    
       // class constructor
       row_data() : id(0), name("")  { }
    
       unsigned int id;
       string name;
    
    };
    
    // check row object name values for duplicates
    bool check_for_duplicate_names(const vector<row_data>& data) {
    
       // collect list of row names to check for duplicates
       vector<string> list; for(unsigned int i=0; i<data.size(); i++) { list.push_back(data[i].name); }
    
       // sort the string vector
       sort( name_list.begin(), name_list.end() );
    
       // run unique to locate duplicate names
       // moves duplicates to the end and returns iterator to location of first duplicate
       vector<string>::iterator unique_check = unique(name_list.begin(), name_list.end());
    
    // if no duplicates were detected, iterator location will be the end of the container
    return( unique_check == name_list.end() );
    }
    
     // program entry point
    int main() {
    
       // main data container
       vector<row_data> data;
    
       // data is populated
    
       // check if any data.name is a duplicate
       if( check_for_duplicate_names(data) ) {
          cout << "all row names are unique" << endl;
          cout << endl;
       }
       else if( !check_for_duplicate_names(data) ) {
          cout << "all row names are not unique" << endl;
          cout << endl;
       }
    
    return(0);
    }
    The function always returns false as if duplicates were found, no matter if the input file has duplicates or not. My understanding of unique() is that you need to sort the vector first and then unique() moves duplicate values to the end of the vector. Unique() returns an iterator to the location of the first element that was moved, meaning the first position after the last unique element. If the iterator value is equal to .end(), then nothing was moved so there were no duplicates.

    Is that not correct?

    If I print the underlying data from inside the function,
    Code:
    // check row object name values for duplicates
    bool check_for_duplicate_names(const vector<row_data>& data) {
    
       // collect list of row names to check for duplicates
       vector<string> list; for(unsigned int i=0; i<data.size(); i++) { list.push_back(data[i].name); }
    
       // sort the string vector
       sort( name_list.begin(), name_list.end() );
    
       // run unique to locate duplicate names
       // moves duplicates to the end and returns iterator to location of first duplicate
       vector<string>::iterator unique_check = unique(name_list.begin(), name_list.end());
    
       cout << "first duplicate row name" << endl;
       cout << *(unique_check) << endl;
       cout << endl;
    
    // if no duplicates were detected, iterator location will be the end of the container
    return( unique_check == name_list.end() );
    }
    nothing is printed,

    Code:
    first duplicate row name
    suggesting that ( unique_check == name_list.end() ) should evaluate true, so why does does it always return false? Do I need a custom comparison operator or something like that?

    I could do this by inserting into a map, and that might be the better algorithm anyway but I think that above should work and it's bugging me that it doesn't. If I just have the logic backwards, it still shouldn't give the same result for both cases. I have checked that the input files are actually different and I have printed out the number of names and name_list to make sure that it was actually populated.

    Am I missing something obvious here or do I misunderstand what unique() is supposed to do?

    LMHmedchem
    Last edited by LMHmedchem; August 22nd, 2018 at 03:07 PM.

  2. #2
    2kaud's Avatar
    2kaud is online now Super Moderator Power Poster
    Join Date
    Dec 2012
    Location
    England
    Posts
    6,512

    Re: detecting duplicates in a vector<string>

    Am I missing something obvious here
    From the code in post #1, yes! (look at the variable names). Also, why do a double check? If the if-condition is not true, then it must be not unique! Consider this

    Code:
    // check row object name values for duplicates
    bool check_for_duplicate_names(const vector<row_data>& data) {
    
       // collect list of row names to check for duplicates
       vector<string> name_list; for(unsigned int i=0; i<data.size(); i++) { name_list.push_back(data[i].name); }
    
       // sort the string vector
       sort( name_list.begin(), name_list.end() );
    
       // run unique to locate duplicate names
       // moves duplicates to the end and returns iterator to location of first duplicate
       vector<string>::iterator unique_check = unique(name_list.begin(), name_list.end());
    
    // if no duplicates were detected, iterator location will be the end of the container
    return( unique_check == name_list.end() );
    }
    
     // program entry point
    int main() {
    
       // main data container
       vector<row_data> data;
    
       // data is populated
    
       // check if any data.name is a duplicate
       if( check_for_duplicate_names(data) ) {
          cout << "all row names are unique" << endl;
          cout << endl;
       }
       else {
          cout << "all row names are not unique" << endl;
          cout << endl;
       }
    
    return(0);
    }

    On my VS2017 system

    Code:
    #include <string>
    #include <vector>
    #include <iostream>
    #include <algorithm>
    using namespace std;
    
    class row_data {
    
    public:
    
    	// class constructor
    	row_data() : id(0), name("") { }
    	row_data(const string& s) : name(s), id(0) {}
    
    	unsigned int id;
    	string name;
    };
    
    
    // check row object name values for duplicates
    bool check_for_duplicate_names(const vector<row_data>& data) {
    
    	// collect list of row names to check for duplicates
    	vector<string> name_list; for (unsigned int i = 0; i < data.size(); i++) { name_list.push_back(data[i].name); }
    
    	// sort the string vector
    	sort(name_list.begin(), name_list.end());
    
    	// run unique to locate duplicate names
    	// moves duplicates to the end and returns iterator to location of first duplicate
    	vector<string>::iterator unique_check = unique(name_list.begin(), name_list.end());
    
    	// if no duplicates were detected, iterator location will be the end of the container
    	return(unique_check == name_list.end());
    }
    
    // program entry point
    int main() {
    
    	// main data container
    	vector<row_data> data1 = {{"a"}, {"b"}, {"c"}, {"a"}};
    	vector<row_data> data2 = {{"a"}, {"b"}, {"c"}};
    
    	// data is populated
    
    	// check if any data.name is a duplicate
    	if (check_for_duplicate_names(data1)) {
    		cout << "data1 all row names are unique" << endl;
    		cout << endl;
    	} else {
    		cout << "data1 all row names are not unique" << endl;
    		cout << endl;
    	}
    
    	if (check_for_duplicate_names(data2)) {
    		cout << "data2 all row names are unique" << endl;
    		cout << endl;
    	} else {
    		cout << "data2 all row names are not unique" << endl;
    		cout << endl;
    	}
    
    	return(0);
    }
    produces the expected outout

    Code:
    data1 all row names are not unique
    
    data2 all row names are unique
    All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!

    C++17 Compiler: Microsoft VS2017 (15.9.4)

  3. #3
    2kaud's Avatar
    2kaud is online now Super Moderator Power Poster
    Join Date
    Dec 2012
    Location
    England
    Posts
    6,512

    Re: detecting duplicates in a vector<string>

    could do this by inserting into a map, and that might be the better algorithm anyway
    You would use a set. Consider

    Code:
    // check row object name values for duplicates
    bool check_for_duplicate_names(const vector<row_data>& data)
    {
    	set<string> name_list;
    
    	for (unsigned int i = 0; i < data.size(); i++)
    		if (name_list.insert(data[i].name).second == false)
    			return false;
    
    	return true;
    }
    All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!

    C++17 Compiler: Microsoft VS2017 (15.9.4)

  4. #4
    Join Date
    May 2009
    Location
    Boston
    Posts
    350

    Re: detecting duplicates in a vector<string>

    The problem in my post was a typo. The real problem was that I checked the data container before the names were it in. I printed name_list.size() and it was correct, but all the name values were the "" value that the objects were initialized to. The names were all duplicates, which was obvious when I actually printed the names from inside the function.

    The vector of row_data objects had been populated but all the input values were still in the string vector where they are stored on file read. The values like data.name are converted from string to type in another function that hadn't been called yet. This what happens when you go a couple of years without looking at the code.

    Quote Originally Posted by 2kaud View Post
    You would use a set. Consider

    Code:
    // check row object name values for duplicates
    bool check_for_duplicate_names(const vector<row_data>& data)
    {
        set<string> name_list;
    
        for (unsigned int i = 0; i < data.size(); i++)
            if (name_list.insert(data[i].name).second == false)
                return false;
    
        return true;
    }
    That is pretty efficient since it quits as soon as a duplicate is found. With the unique() solution, you have to copy all the names to the vector and sort before you find out if there are any duplicates. I guess I would use a map if I wanted to keep and output all of the duplicate names. I could probably do that with the unique() solution as well.

    What is the value of set.insert over map.insert?

    LMHmedchem
    Last edited by LMHmedchem; August 22nd, 2018 at 08:16 PM.

  5. #5
    2kaud's Avatar
    2kaud is online now Super Moderator Power Poster
    Join Date
    Dec 2012
    Location
    England
    Posts
    6,512

    Re: detecting duplicates in a vector<string>

    What is the value of set.insert over map.insert?
    map has a key and a value. Unless you want to know for any name how many duplicates there were, you don't need a value. You just need to know if the key already exists or not. Hence a set.

    Also, as you don't need to have the key's sorted you could also use an unsorted set which would probably be quicker. set is tree-based and unordered_set is hash based.

    Code:
    #include <unordered_set>
    ...
    // check row object name values for duplicates
    bool check_for_duplicate_names(const vector<row_data>& data) {
    	unordered_set<string> name_list;
    
    	for (unsigned int i = 0; i < data.size(); i++)
    		if (name_list.insert(data[i].name).second == false)
    			return false;
    
    	return true;
    }

    However, the best way of doing things probably depends upon what you actually are trying to achieve. eg why test the vector for duplicates after it has been populated? Why not test for duplicates as it is being populated if they are not allowed?
    All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!

    C++17 Compiler: Microsoft VS2017 (15.9.4)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Windows Mobile Development Center


Click Here to Expand Forum to Full Width




On-Demand Webinars (sponsored)