question about sorting on bitstrings
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Page 1 of 2 12 LastLast
Results 1 to 15 of 17

Thread: question about sorting on bitstrings

Hybrid View

  1. #1
    Join Date
    May 2009
    Location
    Boston
    Posts
    254

    question about sorting on bitstrings

    Hello,

    I have a problem I am working on where I need to sort some data based on the values of a string of bits. The strings look like this,

    010000001110000000

    there are 18 bits, 1 means a feature is present, 0 means the feature is absent.

    Each of these string has 4 on bits. I need to sort them such that I have the longest possible runs with 3 of the same on bits. It doesn't matter which 3 bits are on, I am just looking to order them in blocks with the longest possible runs. As a second step, the ordered blocks will be sorted by size large>small.

    The following data is ordered like I need it to be.
    Code:
    // block 1, run of 12, keys 1,2,11 are identical (key 12 is also identical)
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    
    // block 2, run of 10, keys 1,2 ,15 are identical
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001100
    
    011000000000001010
    011000000000001010
    011000000000001010
    011000000000001010
    
    // block 3, run of 8, keys 8 ,9,10 are identical
    010000001110000000
    010000001110000000
    010000001110000000
    
    000000001110010000
    000000001110010000
    
    000000001110000010
    
    100000001110000000
    
    000100001110000000
    This is the sort order that I am looking for. I need to be able to take a list of the bit strings in any particular order and sort them into the order above. The algorithm would need to recognize that there are 4 on keys and then look for groupings of three common on keys.

    This is more of an algorithm question than one about specific implementation in code. I generally assume that most programming problems have been solved one way or another, so I am looking for information about this kind of problem. I don't know much about analyzing and manipulating strings of bits, so suggestions would be helpful.

    Is there a standard method for this kind of pattern recognition?

    LMHmedchem

  2. #2
    Join Date
    Jul 2005
    Location
    Netherlands
    Posts
    2,012

    Re: question about sorting on bitstrings

    Quote Originally Posted by LMHmedchem View Post
    Each of these string has 4 on bits. I need to sort them such that I have the longest possible runs with 3 of the same on bits. It doesn't matter which 3 bits are on, I am just looking to order them in blocks with the longest possible runs. As a second step, the ordered blocks will be sorted by size large>small.
    That doesn't seem like an exact definition of the solution. You'll need to define formally what you mean with "longest possible runs". Is a solution with three runs of length 8, 5, 3 better or worse than 7, 6, 4?
    Cheers, D Drmmr

    Please put [code][/code] tags around your code to preserve indentation and make it more readable.

    As long as man ascribes to himself what is merely a posibility, he will not work for the attainment of it. - P. D. Ouspensky

  3. #3
    Join Date
    May 2009
    Location
    Boston
    Posts
    254

    Re: question about sorting on bitstrings

    Yes, I am having trouble with the definition, partly because this is something I am still try to figure out. I'm not exactly sure what the best outcome really is, so I have to start with a best guess. To answer your question, I would say that 8,5,3 would be better, but that is a speculation that I will have to test.

    My best guess is that the best method would be to find the longest single run of 3 and assign all of those strings to a block. That would be 8 in the case that you described. Then you would look only in the remaining strings for the longest run of 3 that you can find. I think that would make sense from the programming point of view as well because you could have a single function that would look for the longest run of x length and you could call the functions as many times as you need to assign everything.

    Does that make sense?

    LMHmedchem

  4. #4
    Join Date
    Jul 2005
    Location
    Netherlands
    Posts
    2,012

    Re: question about sorting on bitstrings

    Quote Originally Posted by LMHmedchem View Post
    Yes, I am having trouble with the definition, partly because this is something I am still try to figure out. I'm not exactly sure what the best outcome really is, so I have to start with a best guess. To answer your question, I would say that 8,5,3 would be better, but that is a speculation that I will have to test.

    My best guess is that the best method would be to find the longest single run of 3 and assign all of those strings to a block. That would be 8 in the case that you described. Then you would look only in the remaining strings for the longest run of 3 that you can find. I think that would make sense from the programming point of view as well because you could have a single function that would look for the longest run of x length and you could call the functions as many times as you need to assign everything.

    Does that make sense?
    Yes, it's a possible way to tackle the problem. You can model the problem as a longest path problem by representing each (unique) input string as a node in the graph and adding edges between any pair of inputs that share 3 'on' bits.

    An alternative approach would be to try to minimize the total number of runs. But it would really depend on the context whether that makes sense or not.
    Cheers, D Drmmr

    Please put [code][/code] tags around your code to preserve indentation and make it more readable.

    As long as man ascribes to himself what is merely a posibility, he will not work for the attainment of it. - P. D. Ouspensky

  5. #5
    Join Date
    Dec 2012
    Location
    England
    Posts
    2,329

    Re: question about sorting on bitstrings

    I'm not sure from where you are getting your data, but one possible way of doing this which may be of interest is
    Code:
    #include <iostream>
    #include <string>
    #include <cmath>
    #include <vector>
    using namespace std;
    
    typedef unsigned int UINT;
    
    typedef vector<UINT> vint;
    
    struct sORDERED {
    	UINT cnt;
    	vint vdata;
    
    	sORDERED() : cnt(0) {
    		vdata.empty();
    	}
    };
    
    const UINT ordsize = 245760;
    
    const UINT strsize = 18;
    
    string data[] = {
    "011000000001100000",
    "011000000000001010",
    "011000000000001010",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000000001100",
    "011000000000001100",
    "000000001110010000",
    "000000001110010000",
    "000000001110000010",
    "011000000000001100",
    "011000000000001100",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000000001100",
    "011000000000001100",
    "011000000000001010",
    "011000000000001010",
    "010000001110000000",
    "010000001110000000",
    "010000001110000000",
    "100000001110000000",
    "000100001110000000",
    };
    
    const UINT comb[][3] = {{0, 1, 2}, {0, 1, 3}, {0, 2, 3}, {1, 2 ,3}};
    
    int main()
    {
    const UINT NoData = sizeof(data) / sizeof(data[0]);
    const UINT NoComb = sizeof(comb) / (sizeof(comb[0][0]) * 3);
    
    sORDERED* order = new sORDERED[ordsize];
    
    bool *del = new bool[NoData];
    
    	for (UINT d = 0; d < NoData; ++d) {
    		UINT ones[4] = {0};
    		const string& st = data[d];
    
    		if (st.length() != strsize) {
    			cout << "Bad format " << st << endl;
    			continue;
    		}
    
    		for (UINT s = 0, o = 0 ; s < strsize; ++s)
    			if (st[s] == '1') 
    				ones[o++] = strsize - s - 1;
    
    		for (UINT o = 0; o < NoComb; ++o) {
    			UINT index = 0;
    			for (UINT c = 0; c < 3; ++c)
    				index += (UINT)pow(2.0, (int)ones[comb[o][c]]);
    
    			if (index >= ordsize) {
    				cout << "bad index! " << index << endl;
    			} else {
    				order[index].cnt++;
    				order[index].vdata.push_back(d);
    			}
    		}
    	}
    
    bool found;
    
    	do {
    		UINT max = 0;
    		UINT mind = 0;
    		found = false;
    		for (int d = 0; d < ordsize; d++) {
    			if (order[d].cnt > 0) {
    				found = true;
    				if (order[d].cnt > max) {
    					max = order[d].cnt;
    					mind = d;
    				}
    			}
    		}
    
    		if (found) {
    			bool out = false;
    			for (UINT v = 0; v < order[mind].vdata.size(); ++v) {
    				if (del[order[mind].vdata[v]] == 0) {
    					cout << data[order[mind].vdata[v]] << endl;
    					out = true;
    					del[order[mind].vdata[v]] = true;
    				}
    			}
    			order[mind].cnt = 0;
    
    			if (out)
    				cout << endl;
    		}
    	} while (found);
    
    	delete [] order;
    	delete [] del;
    
    	return 0;
    }
    This produces the output
    Code:
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    
    011000000000001010
    011000000000001010
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001010
    011000000000001010
    
    000000001110010000
    000000001110010000
    000000001110000010
    010000001110000000
    010000001110000000
    010000001110000000
    100000001110000000
    000100001110000000
    which I think is what you're after. There may well be more efficient ways of doing this, but this is the 'quick and dirty' way that first occured to me.
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  6. #6
    Join Date
    May 2009
    Location
    Boston
    Posts
    254

    Re: question about sorting on bitstrings

    Thanks,

    I compiled the code you posted and ran it with some additional data. This is the code I compiled,
    Code:
    #include <iostream>
    #include <string>
    #include <cmath>
    #include <vector>
    using namespace std;
    
    typedef unsigned int UINT;
    
    typedef vector<UINT> vint;
    
    struct sORDERED {
    	UINT cnt;
    	vint vdata;
    
    	sORDERED() : cnt(0) {
    		vdata.empty();
    	}
    };
    
    const UINT ordsize = 245760;
    
    const UINT strsize = 18;
    
    string data[] = {
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001010",
    "011000000000001010",
    "011000000000001010",
    "011000000000001010",
    "011000000000001010",
    "010000001110000000",
    "010000001110000000",
    "010000001110000000",
    "000000001110010000",
    "000000001110010000",
    "000000001110000010",
    "000000001110000010",
    "100000001110000000",
    "000100001110000000",
    "000000001110001000",
    "000000011110000000",
    "000000001110000001",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "010000000000001110",
    "010000000000001110",
    "010000000000001110",
    "010000000000001110",
    "010000000000001110",
    "000001100001100000",
    "000001100001100000",
    "000001100001100000",
    "000001100001100000",
    "000001100000001010",
    "010100000000001001",
    "010100000000001001",
    "010000000001100010",
    "010000000001100010",
    "100000000000011100",
    "100000000000011100",
    "011001100000000000",
    "111001000000000000",
    "110100000000000010",
    "110001000000000010",
    "011000000000010010",
    "010000001100000010",
    "110000000000001100",
    "010000000000011100",
    "100110000000000010",
    "000110000000001010",
    "000110000000001100",
    "100111000000000000",
    "100101100000000000",
    "100001001100000000",
    "000000010000001110",
    "000100000000011010",
    "000000001100001010",
    "000000001100010010",
    "000001000001101000",
    "100001000001100000",
    "000000010001101000",
    "000000000001101100"
    };
    
    const UINT comb[][3] = {{0, 1, 2}, {0, 1, 3}, {0, 2, 3}, {1, 2 ,3}};
    
    int main()
    {
    const UINT NoData = sizeof(data) / sizeof(data[0]);
    const UINT NoComb = sizeof(comb) / (sizeof(comb[0][0]) * 3);
    
    sORDERED* order = new sORDERED[ordsize];
    
    bool *del = new bool[NoData];
    
    	for (UINT d = 0; d < NoData; ++d) {
    		UINT ones[4] = {0};
    		const string& st = data[d];
    
    		if (st.length() != strsize) {
    			cout << "Bad format " << st << endl;
    			continue;
    		}
    
    		for (UINT s = 0, o = 0 ; s < strsize; ++s)
    			if (st[s] == '1') 
    				ones[o++] = strsize - s - 1;
    
    		for (UINT o = 0; o < NoComb; ++o) {
    			UINT index = 0;
    			for (UINT c = 0; c < 3; ++c)
    				index += (UINT)pow(2.0, (int)ones[comb[o][c]]);
    
    			if (index >= ordsize) {
    				cout << "bad index! " << index << endl;
    			} else {
    				order[index].cnt++;
    				order[index].vdata.push_back(d);
    			}
    		}
    	}
    
    bool found;
    
    	do {
    		UINT max = 0;
    		UINT mind = 0;
    		found = false;
    		for (int d = 0; d < ordsize; d++) {
    			if (order[d].cnt > 0) {
    				found = true;
    				if (order[d].cnt > max) {
    					max = order[d].cnt;
    					mind = d;
    				}
    			}
    		}
    
    		if (found) {
    			bool out = false;
    			for (UINT v = 0; v < order[mind].vdata.size(); ++v) {
    				if (del[order[mind].vdata[v]] == 0) {
    					cout << data[order[mind].vdata[v]] << endl;
    					out = true;
    					del[order[mind].vdata[v]] = true;
    				}
    			}
    			order[mind].cnt = 0;
    
    			if (out)
    				cout << endl;
    		}
    	} while (found);
    
    	delete [] order;
    	delete [] del;
    
    	return 0;
    }
    This seems to work with a few caveats. This is the annotated output
    Code:
    // run of 28, 1,11,12
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    011000000001100000
    010000000001100010
    010000000001100010
    
    // run of 14, 1,14,15
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001100
    011000000000001100
    010000000000001110
    010000000000001110
    010000000000001110
    010000000000001110
    010000000000001110
    110000000000001100
    010000000000011100
    
    // run of 12, 8,9,10
    010000001110000000
    010000001110000000
    010000001110000000
    000000001110010000
    000000001110010000
    000000001110000010
    000000001110000010
    100000001110000000
    000100001110000000
    000000001110001000
    000000011110000000
    000000001110000001
    
    // run of 5, 1,2,14
    011000000000001010
    011000000000001010
    011000000000001010
    011000000000001010
    011000000000001010
    
    // run of 8, 1,2,14
    000110000001100000
    000110000001100000
    000110000001100000
    000110000001100000
    000110000001100000
    000110000001100000
    000110000001100000
    000110000001100000
    
    // single
    000000010000001110
    
    // run of 6, 5,11,12
    000001100001100000
    000001100001100000
    000001100001100000
    000001100001100000
    000001000001101000
    100001000001100000
    
    // single
    011000000000010010
    
    // run of 3, 8,9,16
    010000001100000010
    000000001100001010
    000000001100010010
    
    // run of 2, 0,13,14
    100000000000011100
    100000000000011100
    
    // run of 2, 11,12,14
    000000010001101000
    000000000001101100
    
    // run of 2, 1,3,14
    010100000000001001
    010100000000001001
    
    // run of 2, 3,14,16
    000110000000001010
    000100000000011010
    
    // single
    100110000000000010
    
    // single
    000110000000001100
    
    // run of 2, 1,2,5
    011001100000000000
    
    111001000000000000
    
    // single
    100001001100000000
    
    // single
    110100000000000010
    
    // run of 2, 0,3,5
    100111000000000000
    100101100000000000
    
    // single
    110001000000000010
    
    // single
    000001100000001010
    The first three blocks are output in order of size with the largest first. This is what I need. There does not appear to be any size order after that. It will take me a bit to work through the code, but I can probably figure that part out.

    The other thing I will need to do is to look for sets of different sizes. For example, there are several strings in this data that simply do not have 3 bits in common with any of the others. In that case, I would want to repeat the process looking for 2 common bits. I will also have strings with 5 on bits where I want to find runs of 4 matching bits. I think that the simplest thing would be to have a general algorithm where the size of the common bit groupings would be a variable. If that were the case, I could use the same code to look for any size subsets.

    Does that make any sense? Let me know.

    Also, looking through the singles that were found, there are at least 2,

    110100000000000010
    110001000000000010

    that were not found as a 3 bit set.

    To answer your question about the source of the data, this is spreadsheet like tab delimited data and there will be entire rows that need to be sorted along with the bit strings. I usually use a columns class for manipulating data like that, so I will have to shoehorn this into a larger program to read in a text file, pick out the bitkey column, sort the data, and then write the sorted output to a file. I will start putting that together tomorrow since the IO stuff is pretty boilerplate.

    Thanks again for the input.

    LMHmedchem
    Last edited by LMHmedchem; April 2nd, 2014 at 12:04 AM.

  7. #7
    Join Date
    Dec 2012
    Location
    England
    Posts
    2,329

    Re: question about sorting on bitstrings

    Also, looking through the singles that were found, there are at least 2,

    110100000000000010
    110001000000000010

    that were not found as a 3 bit set.
    Changing
    Code:
    for (int d = 0; d < ordsize; d++) {
    to
    Code:
    for (int d = ordsize - 1; d >= 0; d--) {
    should correct that.
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  8. #8
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    3,845

    Re: question about sorting on bitstrings

    This looks like it's a variant of a fitting test...

    what happens if you find a long run, but some of the items in that run then also appear in another run (with 4 on bits, and a test for 3, every string will be part of 4 possible blocks). Does it's value count in all 4 or only in one of the 4

    Do you just want a greedy algorithm that finds the longest, then the longest of what remains, and the longest of what remains then...

    or do you want an algorithm that will fidn the best distribution or the one that will result in the fewest amount of blocks in total ?

  9. #9
    Join Date
    Jul 2013
    Posts
    256

    Re: question about sorting on bitstrings

    Quote Originally Posted by LMHmedchem View Post
    It doesn't matter which 3 bits are on, I am just looking to order them in blocks with the longest possible runs. As a second step, the ordered blocks will be sorted by size large>small.
    The binominal coefficient 18 over 3 tells how many different ways 3 bits can be set among 18 bits. It's 816.

    So one solution is to generate all 816 3-bit combinations and count which of these are present in the largest number of 18-bit strings. That's the biggest possible run. Then the strings of this run are removed and everything is repeated again without them. That gives the second biggest run, etcetera. This is a greedy algorithm for what (although somewhat relaxed) basically is a set cover problem so it's only approximately optimal.

    Generating all 3-bit subsets of an 18-bit set (or more generally all K sets of an N set) can be done using an interesting method called Gosper's hack. It works on ints interpreted as bitsets.

    This approach is feasible if the binominal coefficint isn't too big, but 816 is no problem. 9 bits among 18 gives a binominal of 48,620 and then it maybe starts to take time. That on the other hand is the worst it gets for an 18 bit set. Basically it's row 18 in Pascal's trangle.

    I've coded a solution using VS 2013. The runs are generated and then sorted before printing. The strings are interpreted as integer bitpatterns and sorted as integers from largest to smallest before output. But this of just an example and it can easily be changed to something else.

    Code:
    #include <list>
    #include <vector>
    #include <string>
    #include <bitset>
    
    const int strsize = 18;
    
    std::vector<std::string> data = {
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001010",
    "011000000000001010",
    "011000000000001010",
    "011000000000001010",
    "011000000000001010",
    "010000001110000000",
    "010000001110000000",
    "010000001110000000",
    "000000001110010000",
    "000000001110010000",
    "000000001110000010",
    "000000001110000010",
    "100000001110000000",
    "000100001110000000",
    "000000001110001000",
    "000000011110000000",
    "000000001110000001",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "010000000000001110",
    "010000000000001110",
    "010000000000001110",
    "010000000000001110",
    "010000000000001110",
    "000001100001100000",
    "000001100001100000",
    "000001100001100000",
    "000001100001100000",
    "000001100000001010",
    "010100000000001001",
    "010100000000001001",
    "010000000001100010",
    "010000000001100010",
    "100000000000011100",
    "100000000000011100",
    "011001100000000000",
    "111001000000000000",
    "110100000000000010",
    "110001000000000010",
    "011000000000010010",
    "010000001100000010",
    "110000000000001100",
    "010000000000011100",
    "100110000000000010",
    "000110000000001010",
    "000110000000001100",
    "100111000000000000",
    "100101100000000000",
    "100001001100000000",
    "000000010000001110",
    "000100000000011010",
    "000000001100001010",
    "000000001100010010",
    "000001000001101000",
    "100001000001100000",
    "000000010001101000",
    "000000000001101100",
    "000000000000000000",
    "000000000001100000"
    };
    
    
    inline int next(int x) { // next K subset of an N bitset (gosper's hack (hakmem 175))
    	int s = x & -x;
    	int r = x + s;
    	return r | (((x^r)>>2)/s);
    }
    
    inline int start(int k) { // first subset
    	return (1<<k) - 1;
    }
    
    inline int limit(int n) { // first non-subset
    	return 1<<n;
    }
    
    int to_set(const std::string& s) { // string to bitset
    	return int(std::bitset<strsize>(s).to_ulong());
    }
    
    std::string to_string(int x) { // bitset to string
    	return std::bitset<strsize>(x).to_string();
    }
    
    void run() {
    
    	const int N = strsize; // size of user sets (string length)
    	const int K = 3; // size of subset
    
    	std::list<int> usersets; // all user sets converted to bitsets
    	for (std::string s : data) usersets.push_back(to_set(s));
    	
    	std::vector<int> counts; // one counter per subset
    	int b = 0; // number of K subsets of an N set (the N over K binominal)
    	for (int s=start(K); s<limit(N); s=next(s)) ++b;
    	std::cout << "*** binominal = " << b << std::endl;
    	counts.resize(b);
    
    	bool quit = false;
    	do {
    		for (int& c : counts) c = 0; // reset counters
    
    			// count all user sets for every subset 
    		for (int s=start(K), j=0; s<limit(N); s=next(s), ++j) {
    			for (int set : usersets) {
    				if ((set & s) == s) ++counts[j];
    			}
    		}
    
    			// find max counter (biggest run)
    		int runSet = 0;
    		for (int s=start(K), j=0, m=0; s<limit(N); s=next(s), ++j) {
    			if (counts[j] > m) {
    				m = counts[j];
    				runSet = s;
    			}
    		}
    
    			// print all user sets belonging to this run
    		if (runSet > 0) {
    				// gather output sets
    			std::vector<int> out;
    			for (auto sp = usersets.begin(); sp != usersets.end();) {
    				if ((*sp & runSet) == runSet) {
    					out.push_back(*sp);
    					sp = usersets.erase(sp); // remove printed user set
    				} else ++sp;
    			}
    
    				//actual output
    			std::sort(out.rbegin(),out.rend());  // from largest to smallest
    			std::cout << to_string(runSet) << " *" << std::endl;
    			for (int x : out) {
    				std::cout << to_string(x) << " : " << x << std::endl;
    			}
    		} else quit = true; // no more runs
    		
    		std::cout << std::endl;
    
    	} while (!quit);
    
    		// print all remaining user sets
    	if (usersets.size() > 0) {
    		std::cout << "*** remaining sets:" << std::endl;
    		for (int x : usersets) std::cout << to_string(x) << std::endl;
    	}
    }
    Last edited by razzle; April 3rd, 2014 at 12:43 AM.

  10. #10
    Join Date
    May 2009
    Location
    Boston
    Posts
    254

    Re: question about sorting on bitstrings

    Thanks to everyone for all the thoughtful replies so far. I am compiling the code posted by razzle and having some issues. I don't have VS, so I am compiling with gnu. I don't think that the first issue I am running into has anything to do with that.

    Very simply, the following code will not compile,
    Code:
    #include <iostream>
    #include <string>
    #include <vector>
    
    using namespace std;
    
    std::vector<std::string> data = { "011000000001100000", "011000000001100000" };
    I am getting an error,
    error: ‘data’ does not name a type

    I know that not every compiler supports C++11 initializer lists, so I have tried separating the declaration from the assignment and tried to assign by some different methods, like .assign and .push_back.
    Code:
    #include <iostream>
    #include <string>
    #include <vector>
    
    using namespace std;
    
    std::vector<std::string> data;
    
    data[] = { "011000000001100000", "011000000001100000" };
     
    // or
    data.assign( "011000000001100000", "011000000001100000"  );
    
    // or
    data.push_back("011000000001100000");
    data.push_back("011000000001100000");
    None of that makes a difference and the compiler doesn't seem to be recognizing the declaration of data as a vec of string.

    I always seem to get stuck on the simple things.

    LMHmedchem
    Last edited by LMHmedchem; April 2nd, 2014 at 12:18 PM.

  11. #11
    Join Date
    Dec 2012
    Location
    England
    Posts
    2,329

    Re: question about sorting on bitstrings

    Based upon Razzle's code from his post #9. This will now compile with MSVC 2012 so hopefully it will compile and run with your compiler.
    Code:
    #include <cstdlib>
    #include <list>
    #include <vector>
    #include <string>
    #include <bitset>
    #include <iostream>
    #include <algorithm>
    
    const int strsize = 18;
    
    //std::vector<std::string> data = {
    std::string data[] = {
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000001100000",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001100",
    "011000000000001010",
    "011000000000001010",
    "011000000000001010",
    "011000000000001010",
    "011000000000001010",
    "010000001110000000",
    "010000001110000000",
    "010000001110000000",
    "000000001110010000",
    "000000001110010000",
    "000000001110000010",
    "000000001110000010",
    "100000001110000000",
    "000100001110000000",
    "000000001110001000",
    "000000011110000000",
    "000000001110000001",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "000110000001100000",
    "010000000000001110",
    "010000000000001110",
    "010000000000001110",
    "010000000000001110",
    "010000000000001110",
    "000001100001100000",
    "000001100001100000",
    "000001100001100000",
    "000001100001100000",
    "000001100000001010",
    "010100000000001001",
    "010100000000001001",
    "010000000001100010",
    "010000000001100010",
    "100000000000011100",
    "100000000000011100",
    "011001100000000000",
    "111001000000000000",
    "110100000000000010",
    "110001000000000010",
    "011000000000010010",
    "010000001100000010",
    "110000000000001100",
    "010000000000011100",
    "100110000000000010",
    "000110000000001010",
    "000110000000001100",
    "100111000000000000",
    "100101100000000000",
    "100001001100000000",
    "000000010000001110",
    "000100000000011010",
    "000000001100001010",
    "000000001100010010",
    "000001000001101000",
    "100001000001100000",
    "000000010001101000",
    "000000000001101100",
    "000000000000000000",
    "000000000001100000"
    };
    
    
    inline int next(int x) { // next K subset of an N bitset (gosper's hack (hakmem 175))
    	int s = x & -x;
    	int r = x + s;
    	return r | (((x^r)>>2)/s);
    }
    
    inline int start(int k) { // first subset
    	return (1<<k) - 1;
    }
    
    inline int limit(int n) { // first non-subset
    	return 1<<n;
    }
    
    int to_set(const std::string& s) { // string to bitset
    	return int(std::bitset<strsize>(s).to_ulong());
    }
    
    std::string to_string(int x) { // bitset to string
    	return std::bitset<strsize>(x).to_string();
    }
    
    int main() {
    
    	const int N = strsize; // size of user sets (string length)
    	const int K = 3; // size of subset
    
    	std::list<int> usersets; // all user sets converted to bitsets
    	for (std::string s : data) usersets.push_back(to_set(s));
    	
    	std::vector<int> counts; // one counter per subset
    	int b = 0; // number of K subsets of an N set (the N over K binominal)
    	for (int s=start(K); s<limit(N); s=next(s)) ++b;
    	std::cout << "*** binominal = " << b << std::endl;
    	counts.resize(b);
    
    	bool quit = false;
    	do {
    		for (int& c : counts) c = 0; // reset counters
    
    			// count all user sets for every subset 
    		for (int s=start(K), j=0; s<limit(N); s=next(s), ++j) {
    			for (int set : usersets) {
    				if ((set & s) == s) ++counts[j];
    			}
    		}
    
    			// find max counter (biggest run)
    		int runSet = 0;
    		for (int s=start(K), j=0, m=0; s<limit(N); s=next(s), ++j) {
    			if (counts[j] > m) {
    				m = counts[j];
    				runSet = s;
    			}
    		}
    
    			// print all user sets belonging to this run
    		if (runSet > 0) {
    				// gather output sets
    			std::vector<int> out;
    			for (auto sp = usersets.begin(); sp != usersets.end();) {
    				if ((*sp & runSet) == runSet) {
    					out.push_back(*sp);
    					sp = usersets.erase(sp); // remove printed user set
    				} else ++sp;
    			}
    
    				//actual output
    			std::sort(out.rbegin(),out.rend());  // from largest to smallest
    			std::cout << to_string(runSet) << " *" << std::endl;
    			for (int x : out) {
    				std::cout << to_string(x) << " : " << x << std::endl;
    			}
    		} else quit = true; // no more runs
    		
    		std::cout << std::endl;
    
    	} while (!quit);
    
    		// print all remaining user sets
    	if (usersets.size() > 0) {
    		std::cout << "*** remaining sets:" << std::endl;
    		for (int x : usersets) std::cout << to_string(x) << std::endl;
    	}
    }
    producing the output
    Code:
    *** binominal = 816
    010000000001100000 *
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    010000000001100010 : 65634
    010000000001100010 : 65634
    
    010000000000001100 *
    110000000000001100 : 196620
    011000000000001100 : 98316
    011000000000001100 : 98316
    011000000000001100 : 98316
    011000000000001100 : 98316
    011000000000001100 : 98316
    011000000000001100 : 98316
    011000000000001100 : 98316
    010000000000011100 : 65564
    010000000000001110 : 65550
    010000000000001110 : 65550
    010000000000001110 : 65550
    010000000000001110 : 65550
    010000000000001110 : 65550
    
    000000001110000000 *
    100000001110000000 : 131968
    010000001110000000 : 66432
    010000001110000000 : 66432
    010000001110000000 : 66432
    000100001110000000 : 17280
    000000011110000000 : 1920
    000000001110010000 : 912
    000000001110010000 : 912
    000000001110001000 : 904
    000000001110000010 : 898
    000000001110000010 : 898
    000000001110000001 : 897
    
    000010000001100000 *
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    
    000001000001100000 *
    100001000001100000 : 135264
    000001100001100000 : 6240
    000001100001100000 : 6240
    000001100001100000 : 6240
    000001100001100000 : 6240
    000001000001101000 : 4200
    
    011000000000000010 *
    011000000000010010 : 98322
    011000000000001010 : 98314
    011000000000001010 : 98314
    011000000000001010 : 98314
    011000000000001010 : 98314
    011000000000001010 : 98314
    
    000000001100000010 *
    010000001100000010 : 66306
    000000001100010010 : 786
    000000001100001010 : 778
    
    000000000000011100 *
    100000000000011100 : 1311000
    100000000000011100 : 1311000
    
    000000000001101000 *
    000000010001101000 : 1128
    000000000001101100 : 108
    
    000100000000001001 *
    010100000000001001 : 81929
    010100000000001001 : 81929
    
    000100000000001010 *
    000110000000001010 : 24586
    000100000000011010 : 16410
    
    011001000000000000 *
    111001000000000000 : 233472
    011001100000000000 : 104448
    
    100100000000000010 *
    110100000000000010 : 212994
    100110000000000010 : 155650
    
    100101000000000000 *
    100111000000000000 : 159744
    100101100000000000 : 153600
    
    000000000000001110 *
    000000010000001110 : 1038
    
    000000100000001010 *
    000001100000001010 : 6154
    
    000001001100000000 *
    100001001100000000 : 135936
    
    000010000000001100 *
    000110000000001100 : 24588
    
    010001000000000010 *
    110001000000000010 : 200706
    
    
    *** remaining sets:
    000000000000000000
    000000000001100000
    Last edited by 2kaud; April 2nd, 2014 at 04:43 PM. Reason: Corrected output from post #13
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  12. #12
    Join Date
    Jul 2013
    Posts
    256

    Re: question about sorting on bitstrings

    Quote Originally Posted by 2kaud View Post
    producing the output
    Thank you for rewriting it in VS 2012.

    The string ending with a * is the 3-bit subset that served as template for the strings collected in each run.

    The numbers after : are the strings interpreted as integers. I printed them since the output is sorted based on them. It seems integers longer than 5 digits have been truncated in your output. Here's the correct output.

    Code:
    *** binominal = 816
    010000000001100000 *
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    011000000001100000 : 98400
    010000000001100010 : 65634
    010000000001100010 : 65634
    
    010000000000001100 *
    110000000000001100 : 196620
    011000000000001100 : 98316
    011000000000001100 : 98316
    011000000000001100 : 98316
    011000000000001100 : 98316
    011000000000001100 : 98316
    011000000000001100 : 98316
    011000000000001100 : 98316
    010000000000011100 : 65564
    010000000000001110 : 65550
    010000000000001110 : 65550
    010000000000001110 : 65550
    010000000000001110 : 65550
    010000000000001110 : 65550
    
    000000001110000000 *
    100000001110000000 : 131968
    010000001110000000 : 66432
    010000001110000000 : 66432
    010000001110000000 : 66432
    000100001110000000 : 17280
    000000011110000000 : 1920
    000000001110010000 : 912
    000000001110010000 : 912
    000000001110001000 : 904
    000000001110000010 : 898
    000000001110000010 : 898
    000000001110000001 : 897
    
    000010000001100000 *
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    000110000001100000 : 24672
    
    000001000001100000 *
    100001000001100000 : 135264
    000001100001100000 : 6240
    000001100001100000 : 6240
    000001100001100000 : 6240
    000001100001100000 : 6240
    000001000001101000 : 4200
    
    011000000000000010 *
    011000000000010010 : 98322
    011000000000001010 : 98314
    011000000000001010 : 98314
    011000000000001010 : 98314
    011000000000001010 : 98314
    011000000000001010 : 98314
    
    000000001100000010 *
    010000001100000010 : 66306
    000000001100010010 : 786
    000000001100001010 : 778
    
    000000000000011100 *
    100000000000011100 : 1311000
    100000000000011100 : 1311000
    
    000000000001101000 *
    000000010001101000 : 1128
    000000000001101100 : 108
    
    000100000000001001 *
    010100000000001001 : 81929
    010100000000001001 : 81929
    
    000100000000001010 *
    000110000000001010 : 24586
    000100000000011010 : 16410
    
    011001000000000000 *
    111001000000000000 : 233472
    011001100000000000 : 104448
    
    100100000000000010 *
    110100000000000010 : 212994
    100110000000000010 : 155650
    
    100101000000000000 *
    100111000000000000 : 159744
    100101100000000000 : 153600
    
    000000000000001110 *
    000000010000001110 : 1038
    
    000000100000001010 *
    000001100000001010 : 6154
    
    000001001100000000 *
    100001001100000000 : 135936
    
    000010000000001100 *
    000110000000001100 : 24588
    
    010001000000000010 *
    110001000000000010 : 200706
    
    
    *** remaining sets:
    000000000000000000
    000000000001100000
    Last edited by razzle; April 2nd, 2014 at 04:28 PM.

  13. #13
    Join Date
    May 2009
    Location
    Boston
    Posts
    254

    Re: question about sorting on bitstrings

    I was still having some problems because of the range-based for loops but I did a little looking around and was able to compile with,

    g++ -c -std=c++11 sort2_03.cpp

    It looks like the default for g++ is to compile under c++98, which is a bit odd. Does anyone know if that is correct?

    I suspect that the original code would compile with this as well, but thanks for making the revisions. I am now having a look at the output and will post back later.

    I promise I will address all of the other questions as well, I just need a bit of time to finish up some things.

    LMHmedchem

  14. #14
    Join Date
    Dec 2012
    Location
    England
    Posts
    2,329

    Re: question about sorting on bitstrings

    It seems integers longer than 5 digits have been truncated in your output. Here's the correct output.
    Yeah, you're right. Sorry. I mustn't have set the capture width wide enough when I was copying the output from my console window.
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  15. #15
    Join Date
    May 2009
    Location
    Boston
    Posts
    254

    Re: question about sorting on bitstrings

    Well I was able to run this set of data and I pulled out all of the strings that were found to have common 3 bit subsets. Then I dropped the subset size to 2 and run the strings that had no common 3 bit sub sets. I think that this worked well and I like the order that resulted. I then did the analogous analysis looking at strings with 5 on bits. I first looked first for 4 bit common sub sets, and then for 3 bit common sub sets on strings that did not have 4 bits in common. So far I have processed up to strings with 6 on bits.

    In this data, I have strings with as many as 12 on bits. Those are not important because there are only a few of them and they will come at the end regardless. I know that will all possible sub sets approaches, the possibilities can blow up numerically. I am wondering if that could happen here as the sub sets size increases. I think that the binomial coefficient will maximize where half the bits are on. I think there would be ~50,000 possible true cases for that but I don't quite remember the formula. 50,000 is not a particularly large number for a computer, but I am still wondering what would happen for that case. I am sure there would be some performance loss, but that is not relevant here. For me, as long as a program can finish in less than the time it takes to get coffee, it's a fast program. In other words, the program is going to be waiting on me more the me waiting on the program.

    I think that this code will work and I can functionalize it to call with a sub set size.

    Quote Originally Posted by OReubens View Post
    This looks like it's a variant of a fitting test...

    what happens if you find a long run, but some of the items in that run then also appear in another run (with 4 on bits, and a test for 3, every string will be part of 4 possible blocks). Does it's value count in all 4 or only in one of the 4

    Do you just want a greedy algorithm that finds the longest, then the longest of what remains, and the longest of what remains then...

    or do you want an algorithm that will fidn the best distribution or the one that will result in the fewest amount of blocks in total ?
    In short, I have no idea what the answer to that will be because I have no basis for determining an optimum. The reason I am doing this project is to search for the optimal pattern, or at least patterns that are consistently better than others. It is probably useful to give some context to the term "better". I am working on an AI project where the learning algorithm processes on input pattern at a time. There is no clear information at present concerning the optimal order to present the patterns, so I am working on that. It is clear that there are some orders that do not work well in this case (random, permuted), so I am looking at other methods. I have tried grouping the input patterns with hierarchical clustering and that is an improvement over random. It is clear that it is helpful for the patterns to be presented in groups that are similar.

    I have begun looking at a presentation order that is theoretically simple > complex. Using the bit strings, the simplest patterns have only 1 on bit. More on bits means more complex. Within that generalization, I think that it will be useful to have the longest possible runs where the features of the patterns change very little. In the bit strings I am using, each on bit represents a feature that is present. In reality, that feature is described with a real number between 0.1-0.9. Just because two patterns have the same on bits, the will not likely have the same real number value for the features those bits describe.

    One presentation method would be to present all the patterns that have the same on bits, starting with the largest group and ending with singles without 4 matching bits. Something like,

    Code:
    010000000000001110
    010000000000001110
    010000000000001110
    010000000000001110
    010000000000001110
    
    010000001110000000
    010000001110000000
    010000001110000000
    
    000000001110010000
    000000001110010000
    
    110000000000001100
    
    010000000000011100
    
    100000001110000000
    This results in some runs of several identical patterns, but also in some orphan patterns that are presented singly and may, or may not have anything in common with the previous pattern.

    If groupings of 3 out of 4 matching bits are used for the same patterns,
    Code:
    010000000000001110
    010000000000001110
    010000000000001110
    010000000000001110
    010000000000001110
    110000000000001100
    010000000000011100
    
    010000001110000000
    010000001110000000
    010000001110000000
    000000001110010000
    000000001110010000
    100000001110000000
    you end up with longer runs and no orphans. Within each block of 3 matching bits, blocks of identical patterns are still kept together and the sort order places the largest identical block first, followed by the second largest, etc. Cases where all 4 bits are identical will still occur in blocks, but these blocks will be built on be adding other patterns with 3 out of the 4 bits. At wost, there will be only one novel feature as the learning algorithm progresses from pattern to pattern. That is the overall philosophy, to never have more than one feature that is different between sequential patterns. That is not possible with real data, so I am trying to get reasonably close.

    I have no idea how this will turn out an I may have to try a number of things before I can convince myself that one method gives consistently better results. There are thousands of patterns in the data, so it is necessary to put some thought into algorithms that can be used to sort the data with some reasonable automation. I have done this kind of thing by hand with some data sets and it is as tedious and error prone as you would expect. The bigger problem is that there is no way to guarantee that you have the the order you think you do when the patterns get complex.

    In some way this resembles the Tanimoto and Dice coefficients that can be used to compare bits in common and bits in difference. I don't really know much about fitting tests or coverage algorithms.

    I just re-read some of the thread an noticed that razzle posted that the 9 out of 18 binominal is 48,620, so I missed that. I am still not overly worried about the time because this data set only has 8 patterns with 9 on bits. As long as there is no memory allocation issue, then I think it will be fine. I will test that tomorrow.

    Making this code into a function that will sort my entire data sheet will be another challenge. I don't think I have any sort functions in my column class. I guess it is time to add some.

    LMHmedchem
    Last edited by LMHmedchem; April 3rd, 2014 at 12:08 AM.

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Azure Activities Information Page

Windows Mobile Development Center


Click Here to Expand Forum to Full Width

This is a CodeGuru survey question.


Featured


HTML5 Development Center