Show 50 post(s) from this thread on one page
Page 1 of 2 12 Last
• April 1st, 2014, 01:20 PM
LMHmedchem
Hello,

I have a problem I am working on where I need to sort some data based on the values of a string of bits. The strings look like this,

010000001110000000

there are 18 bits, 1 means a feature is present, 0 means the feature is absent.

Each of these string has 4 on bits. I need to sort them such that I have the longest possible runs with 3 of the same on bits. It doesn't matter which 3 bits are on, I am just looking to order them in blocks with the longest possible runs. As a second step, the ordered blocks will be sorted by size large>small.

The following data is ordered like I need it to be.
Code:

```// block 1, run of 12, keys 1,2,11 are identical (key 12 is also identical) 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 // block 2, run of 10, keys 1,2 ,15 are identical 011000000000001100 011000000000001100 011000000000001100 011000000000001100 011000000000001100 011000000000001100 011000000000001010 011000000000001010 011000000000001010 011000000000001010 // block 3, run of 8, keys 8 ,9,10 are identical 010000001110000000 010000001110000000 010000001110000000 000000001110010000 000000001110010000 000000001110000010 100000001110000000 000100001110000000```
This is the sort order that I am looking for. I need to be able to take a list of the bit strings in any particular order and sort them into the order above. The algorithm would need to recognize that there are 4 on keys and then look for groupings of three common on keys.

This is more of an algorithm question than one about specific implementation in code. I generally assume that most programming problems have been solved one way or another, so I am looking for information about this kind of problem. I don't know much about analyzing and manipulating strings of bits, so suggestions would be helpful.

Is there a standard method for this kind of pattern recognition?

LMHmedchem
• April 1st, 2014, 01:35 PM
D_Drmmr
Re: question about sorting on bitstrings
Quote:

Originally Posted by LMHmedchem
Each of these string has 4 on bits. I need to sort them such that I have the longest possible runs with 3 of the same on bits. It doesn't matter which 3 bits are on, I am just looking to order them in blocks with the longest possible runs. As a second step, the ordered blocks will be sorted by size large>small.

That doesn't seem like an exact definition of the solution. You'll need to define formally what you mean with "longest possible runs". Is a solution with three runs of length 8, 5, 3 better or worse than 7, 6, 4?
• April 1st, 2014, 01:44 PM
LMHmedchem
Re: question about sorting on bitstrings
Yes, I am having trouble with the definition, partly because this is something I am still try to figure out. I'm not exactly sure what the best outcome really is, so I have to start with a best guess. To answer your question, I would say that 8,5,3 would be better, but that is a speculation that I will have to test.

My best guess is that the best method would be to find the longest single run of 3 and assign all of those strings to a block. That would be 8 in the case that you described. Then you would look only in the remaining strings for the longest run of 3 that you can find. I think that would make sense from the programming point of view as well because you could have a single function that would look for the longest run of x length and you could call the functions as many times as you need to assign everything.

Does that make sense?

LMHmedchem
• April 1st, 2014, 06:18 PM
2kaud
Re: question about sorting on bitstrings
I'm not sure from where you are getting your data, but one possible way of doing this which may be of interest is
Code:

```#include <iostream> #include <string> #include <cmath> #include <vector> using namespace std; typedef unsigned int UINT; typedef vector<UINT> vint; struct sORDERED {         UINT cnt;         vint vdata;         sORDERED() : cnt(0) {                 vdata.empty();         } }; const UINT ordsize = 245760; const UINT strsize = 18; string data[] = { "011000000001100000", "011000000000001010", "011000000000001010", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000000001100", "011000000000001100", "000000001110010000", "000000001110010000", "000000001110000010", "011000000000001100", "011000000000001100", "011000000001100000", "011000000001100000", "011000000001100000", "011000000000001100", "011000000000001100", "011000000000001010", "011000000000001010", "010000001110000000", "010000001110000000", "010000001110000000", "100000001110000000", "000100001110000000", }; const UINT comb[][3] = {{0, 1, 2}, {0, 1, 3}, {0, 2, 3}, {1, 2 ,3}}; int main() { const UINT NoData = sizeof(data) / sizeof(data[0]); const UINT NoComb = sizeof(comb) / (sizeof(comb[0][0]) * 3); sORDERED* order = new sORDERED[ordsize]; bool *del = new bool[NoData];         for (UINT d = 0; d < NoData; ++d) {                 UINT ones[4] = {0};                 const string& st = data[d];                 if (st.length() != strsize) {                         cout << "Bad format " << st << endl;                         continue;                 }                 for (UINT s = 0, o = 0 ; s < strsize; ++s)                         if (st[s] == '1')                                 ones[o++] = strsize - s - 1;                 for (UINT o = 0; o < NoComb; ++o) {                         UINT index = 0;                         for (UINT c = 0; c < 3; ++c)                                 index += (UINT)pow(2.0, (int)ones[comb[o][c]]);                         if (index >= ordsize) {                                 cout << "bad index! " << index << endl;                         } else {                                 order[index].cnt++;                                 order[index].vdata.push_back(d);                         }                 }         } bool found;         do {                 UINT max = 0;                 UINT mind = 0;                 found = false;                 for (int d = 0; d < ordsize; d++) {                         if (order[d].cnt > 0) {                                 found = true;                                 if (order[d].cnt > max) {                                         max = order[d].cnt;                                         mind = d;                                 }                         }                 }                 if (found) {                         bool out = false;                         for (UINT v = 0; v < order[mind].vdata.size(); ++v) {                                 if (del[order[mind].vdata[v]] == 0) {                                         cout << data[order[mind].vdata[v]] << endl;                                         out = true;                                         del[order[mind].vdata[v]] = true;                                 }                         }                         order[mind].cnt = 0;                         if (out)                                 cout << endl;                 }         } while (found);         delete [] order;         delete [] del;         return 0; }```
This produces the output
Code:

```011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000000001010 011000000000001010 011000000000001100 011000000000001100 011000000000001100 011000000000001100 011000000000001100 011000000000001100 011000000000001010 011000000000001010 000000001110010000 000000001110010000 000000001110000010 010000001110000000 010000001110000000 010000001110000000 100000001110000000 000100001110000000```
which I think is what you're after. There may well be more efficient ways of doing this, but this is the 'quick and dirty' way that first occured to me.
• April 1st, 2014, 11:14 PM
LMHmedchem
Re: question about sorting on bitstrings
Thanks,

I compiled the code you posted and ran it with some additional data. This is the code I compiled,
Code:

```#include <iostream> #include <string> #include <cmath> #include <vector> using namespace std; typedef unsigned int UINT; typedef vector<UINT> vint; struct sORDERED {         UINT cnt;         vint vdata;         sORDERED() : cnt(0) {                 vdata.empty();         } }; const UINT ordsize = 245760; const UINT strsize = 18; string data[] = { "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001010", "011000000000001010", "011000000000001010", "011000000000001010", "011000000000001010", "010000001110000000", "010000001110000000", "010000001110000000", "000000001110010000", "000000001110010000", "000000001110000010", "000000001110000010", "100000001110000000", "000100001110000000", "000000001110001000", "000000011110000000", "000000001110000001", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "010000000000001110", "010000000000001110", "010000000000001110", "010000000000001110", "010000000000001110", "000001100001100000", "000001100001100000", "000001100001100000", "000001100001100000", "000001100000001010", "010100000000001001", "010100000000001001", "010000000001100010", "010000000001100010", "100000000000011100", "100000000000011100", "011001100000000000", "111001000000000000", "110100000000000010", "110001000000000010", "011000000000010010", "010000001100000010", "110000000000001100", "010000000000011100", "100110000000000010", "000110000000001010", "000110000000001100", "100111000000000000", "100101100000000000", "100001001100000000", "000000010000001110", "000100000000011010", "000000001100001010", "000000001100010010", "000001000001101000", "100001000001100000", "000000010001101000", "000000000001101100" }; const UINT comb[][3] = {{0, 1, 2}, {0, 1, 3}, {0, 2, 3}, {1, 2 ,3}}; int main() { const UINT NoData = sizeof(data) / sizeof(data[0]); const UINT NoComb = sizeof(comb) / (sizeof(comb[0][0]) * 3); sORDERED* order = new sORDERED[ordsize]; bool *del = new bool[NoData];         for (UINT d = 0; d < NoData; ++d) {                 UINT ones[4] = {0};                 const string& st = data[d];                 if (st.length() != strsize) {                         cout << "Bad format " << st << endl;                         continue;                 }                 for (UINT s = 0, o = 0 ; s < strsize; ++s)                         if (st[s] == '1')                                 ones[o++] = strsize - s - 1;                 for (UINT o = 0; o < NoComb; ++o) {                         UINT index = 0;                         for (UINT c = 0; c < 3; ++c)                                 index += (UINT)pow(2.0, (int)ones[comb[o][c]]);                         if (index >= ordsize) {                                 cout << "bad index! " << index << endl;                         } else {                                 order[index].cnt++;                                 order[index].vdata.push_back(d);                         }                 }         } bool found;         do {                 UINT max = 0;                 UINT mind = 0;                 found = false;                 for (int d = 0; d < ordsize; d++) {                         if (order[d].cnt > 0) {                                 found = true;                                 if (order[d].cnt > max) {                                         max = order[d].cnt;                                         mind = d;                                 }                         }                 }                 if (found) {                         bool out = false;                         for (UINT v = 0; v < order[mind].vdata.size(); ++v) {                                 if (del[order[mind].vdata[v]] == 0) {                                         cout << data[order[mind].vdata[v]] << endl;                                         out = true;                                         del[order[mind].vdata[v]] = true;                                 }                         }                         order[mind].cnt = 0;                         if (out)                                 cout << endl;                 }         } while (found);         delete [] order;         delete [] del;         return 0; }```
This seems to work with a few caveats. This is the annotated output
Code:

```// run of 28, 1,11,12 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 011000000001100000 010000000001100010 010000000001100010 // run of 14, 1,14,15 011000000000001100 011000000000001100 011000000000001100 011000000000001100 011000000000001100 011000000000001100 011000000000001100 010000000000001110 010000000000001110 010000000000001110 010000000000001110 010000000000001110 110000000000001100 010000000000011100 // run of 12, 8,9,10 010000001110000000 010000001110000000 010000001110000000 000000001110010000 000000001110010000 000000001110000010 000000001110000010 100000001110000000 000100001110000000 000000001110001000 000000011110000000 000000001110000001 // run of 5, 1,2,14 011000000000001010 011000000000001010 011000000000001010 011000000000001010 011000000000001010 // run of 8, 1,2,14 000110000001100000 000110000001100000 000110000001100000 000110000001100000 000110000001100000 000110000001100000 000110000001100000 000110000001100000 // single 000000010000001110 // run of 6, 5,11,12 000001100001100000 000001100001100000 000001100001100000 000001100001100000 000001000001101000 100001000001100000 // single 011000000000010010 // run of 3, 8,9,16 010000001100000010 000000001100001010 000000001100010010 // run of 2, 0,13,14 100000000000011100 100000000000011100 // run of 2, 11,12,14 000000010001101000 000000000001101100 // run of 2, 1,3,14 010100000000001001 010100000000001001 // run of 2, 3,14,16 000110000000001010 000100000000011010 // single 100110000000000010 // single 000110000000001100 // run of 2, 1,2,5 011001100000000000 111001000000000000 // single 100001001100000000 // single 110100000000000010 // run of 2, 0,3,5 100111000000000000 100101100000000000 // single 110001000000000010 // single 000001100000001010```
The first three blocks are output in order of size with the largest first. This is what I need. There does not appear to be any size order after that. It will take me a bit to work through the code, but I can probably figure that part out.

The other thing I will need to do is to look for sets of different sizes. For example, there are several strings in this data that simply do not have 3 bits in common with any of the others. In that case, I would want to repeat the process looking for 2 common bits. I will also have strings with 5 on bits where I want to find runs of 4 matching bits. I think that the simplest thing would be to have a general algorithm where the size of the common bit groupings would be a variable. If that were the case, I could use the same code to look for any size subsets.

Does that make any sense? Let me know.

Also, looking through the singles that were found, there are at least 2,

110100000000000010
110001000000000010

To answer your question about the source of the data, this is spreadsheet like tab delimited data and there will be entire rows that need to be sorted along with the bit strings. I usually use a columns class for manipulating data like that, so I will have to shoehorn this into a larger program to read in a text file, pick out the bitkey column, sort the data, and then write the sorted output to a file. I will start putting that together tomorrow since the IO stuff is pretty boilerplate.

Thanks again for the input.

LMHmedchem
• April 2nd, 2014, 04:35 AM
D_Drmmr
Re: question about sorting on bitstrings
Quote:

Originally Posted by LMHmedchem
Yes, I am having trouble with the definition, partly because this is something I am still try to figure out. I'm not exactly sure what the best outcome really is, so I have to start with a best guess. To answer your question, I would say that 8,5,3 would be better, but that is a speculation that I will have to test.

My best guess is that the best method would be to find the longest single run of 3 and assign all of those strings to a block. That would be 8 in the case that you described. Then you would look only in the remaining strings for the longest run of 3 that you can find. I think that would make sense from the programming point of view as well because you could have a single function that would look for the longest run of x length and you could call the functions as many times as you need to assign everything.

Does that make sense?

Yes, it's a possible way to tackle the problem. You can model the problem as a longest path problem by representing each (unique) input string as a node in the graph and adding edges between any pair of inputs that share 3 'on' bits.

An alternative approach would be to try to minimize the total number of runs. But it would really depend on the context whether that makes sense or not.
• April 2nd, 2014, 06:13 AM
2kaud
Re: question about sorting on bitstrings
Quote:

Also, looking through the singles that were found, there are at least 2,

110100000000000010
110001000000000010

Changing
Code:

`for (int d = 0; d < ordsize; d++) {`
to
Code:

`for (int d = ordsize - 1; d >= 0; d--) {`
should correct that.
• April 2nd, 2014, 07:16 AM
OReubens
Re: question about sorting on bitstrings
This looks like it's a variant of a fitting test...

what happens if you find a long run, but some of the items in that run then also appear in another run (with 4 on bits, and a test for 3, every string will be part of 4 possible blocks). Does it's value count in all 4 or only in one of the 4

Do you just want a greedy algorithm that finds the longest, then the longest of what remains, and the longest of what remains then...

or do you want an algorithm that will fidn the best distribution or the one that will result in the fewest amount of blocks in total ?
• April 2nd, 2014, 10:52 AM
razzle
Re: question about sorting on bitstrings
Quote:

Originally Posted by LMHmedchem
It doesn't matter which 3 bits are on, I am just looking to order them in blocks with the longest possible runs. As a second step, the ordered blocks will be sorted by size large>small.

The binominal coefficient 18 over 3 tells how many different ways 3 bits can be set among 18 bits. It's 816.

So one solution is to generate all 816 3-bit combinations and count which of these are present in the largest number of 18-bit strings. That's the biggest possible run. Then the strings of this run are removed and everything is repeated again without them. That gives the second biggest run, etcetera. This is a greedy algorithm for what (although somewhat relaxed) basically is a set cover problem so it's only approximately optimal.

Generating all 3-bit subsets of an 18-bit set (or more generally all K sets of an N set) can be done using an interesting method called Gosper's hack. It works on ints interpreted as bitsets.

This approach is feasible if the binominal coefficint isn't too big, but 816 is no problem. 9 bits among 18 gives a binominal of 48,620 and then it maybe starts to take time. That on the other hand is the worst it gets for an 18 bit set. Basically it's row 18 in Pascal's trangle.

I've coded a solution using VS 2013. The runs are generated and then sorted before printing. The strings are interpreted as integer bitpatterns and sorted as integers from largest to smallest before output. But this of just an example and it can easily be changed to something else.

Code:

```#include <list> #include <vector> #include <string> #include <bitset> const int strsize = 18; std::vector<std::string> data = { "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001010", "011000000000001010", "011000000000001010", "011000000000001010", "011000000000001010", "010000001110000000", "010000001110000000", "010000001110000000", "000000001110010000", "000000001110010000", "000000001110000010", "000000001110000010", "100000001110000000", "000100001110000000", "000000001110001000", "000000011110000000", "000000001110000001", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "010000000000001110", "010000000000001110", "010000000000001110", "010000000000001110", "010000000000001110", "000001100001100000", "000001100001100000", "000001100001100000", "000001100001100000", "000001100000001010", "010100000000001001", "010100000000001001", "010000000001100010", "010000000001100010", "100000000000011100", "100000000000011100", "011001100000000000", "111001000000000000", "110100000000000010", "110001000000000010", "011000000000010010", "010000001100000010", "110000000000001100", "010000000000011100", "100110000000000010", "000110000000001010", "000110000000001100", "100111000000000000", "100101100000000000", "100001001100000000", "000000010000001110", "000100000000011010", "000000001100001010", "000000001100010010", "000001000001101000", "100001000001100000", "000000010001101000", "000000000001101100", "000000000000000000", "000000000001100000" }; inline int next(int x) { // next K subset of an N bitset (gosper's hack (hakmem 175))         int s = x & -x;         int r = x + s;         return r | (((x^r)>>2)/s); } inline int start(int k) { // first subset         return (1<<k) - 1; } inline int limit(int n) { // first non-subset         return 1<<n; } int to_set(const std::string& s) { // string to bitset         return int(std::bitset<strsize>(s).to_ulong()); } std::string to_string(int x) { // bitset to string         return std::bitset<strsize>(x).to_string(); } void run() {         const int N = strsize; // size of user sets (string length)         const int K = 3; // size of subset         std::list<int> usersets; // all user sets converted to bitsets         for (std::string s : data) usersets.push_back(to_set(s));                 std::vector<int> counts; // one counter per subset         int b = 0; // number of K subsets of an N set (the N over K binominal)         for (int s=start(K); s<limit(N); s=next(s)) ++b;         std::cout << "*** binominal = " << b << std::endl;         counts.resize(b);         bool quit = false;         do {                 for (int& c : counts) c = 0; // reset counters                         // count all user sets for every subset                 for (int s=start(K), j=0; s<limit(N); s=next(s), ++j) {                         for (int set : usersets) {                                 if ((set & s) == s) ++counts[j];                         }                 }                         // find max counter (biggest run)                 int runSet = 0;                 for (int s=start(K), j=0, m=0; s<limit(N); s=next(s), ++j) {                         if (counts[j] > m) {                                 m = counts[j];                                 runSet = s;                         }                 }                         // print all user sets belonging to this run                 if (runSet > 0) {                                 // gather output sets                         std::vector<int> out;                         for (auto sp = usersets.begin(); sp != usersets.end();) {                                 if ((*sp & runSet) == runSet) {                                         out.push_back(*sp);                                         sp = usersets.erase(sp); // remove printed user set                                 } else ++sp;                         }                                 //actual output                         std::sort(out.rbegin(),out.rend());  // from largest to smallest                         std::cout << to_string(runSet) << " *" << std::endl;                         for (int x : out) {                                 std::cout << to_string(x) << " : " << x << std::endl;                         }                 } else quit = true; // no more runs                                 std::cout << std::endl;         } while (!quit);                 // print all remaining user sets         if (usersets.size() > 0) {                 std::cout << "*** remaining sets:" << std::endl;                 for (int x : usersets) std::cout << to_string(x) << std::endl;         } }```
• April 2nd, 2014, 12:12 PM
LMHmedchem
Re: question about sorting on bitstrings
Thanks to everyone for all the thoughtful replies so far. I am compiling the code posted by razzle and having some issues. I don't have VS, so I am compiling with gnu. I don't think that the first issue I am running into has anything to do with that.

Very simply, the following code will not compile,
Code:

```#include <iostream> #include <string> #include <vector> using namespace std; std::vector<std::string> data = { "011000000001100000", "011000000001100000" };```
I am getting an error,
error: ‘data’ does not name a type

I know that not every compiler supports C++11 initializer lists, so I have tried separating the declaration from the assignment and tried to assign by some different methods, like .assign and .push_back.
Code:

```#include <iostream> #include <string> #include <vector> using namespace std; std::vector<std::string> data; data[] = { "011000000001100000", "011000000001100000" };   // or data.assign( "011000000001100000", "011000000001100000"  ); // or data.push_back("011000000001100000"); data.push_back("011000000001100000");```
None of that makes a difference and the compiler doesn't seem to be recognizing the declaration of data as a vec of string.

I always seem to get stuck on the simple things.

LMHmedchem
• April 2nd, 2014, 02:57 PM
2kaud
Re: question about sorting on bitstrings
Based upon Razzle's code from his post #9. This will now compile with MSVC 2012 so hopefully it will compile and run with your compiler.
Code:

```#include <cstdlib> #include <list> #include <vector> #include <string> #include <bitset> #include <iostream> #include <algorithm> const int strsize = 18; //std::vector<std::string> data = { std::string data[] = { "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000001100000", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001100", "011000000000001010", "011000000000001010", "011000000000001010", "011000000000001010", "011000000000001010", "010000001110000000", "010000001110000000", "010000001110000000", "000000001110010000", "000000001110010000", "000000001110000010", "000000001110000010", "100000001110000000", "000100001110000000", "000000001110001000", "000000011110000000", "000000001110000001", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "000110000001100000", "010000000000001110", "010000000000001110", "010000000000001110", "010000000000001110", "010000000000001110", "000001100001100000", "000001100001100000", "000001100001100000", "000001100001100000", "000001100000001010", "010100000000001001", "010100000000001001", "010000000001100010", "010000000001100010", "100000000000011100", "100000000000011100", "011001100000000000", "111001000000000000", "110100000000000010", "110001000000000010", "011000000000010010", "010000001100000010", "110000000000001100", "010000000000011100", "100110000000000010", "000110000000001010", "000110000000001100", "100111000000000000", "100101100000000000", "100001001100000000", "000000010000001110", "000100000000011010", "000000001100001010", "000000001100010010", "000001000001101000", "100001000001100000", "000000010001101000", "000000000001101100", "000000000000000000", "000000000001100000" }; inline int next(int x) { // next K subset of an N bitset (gosper's hack (hakmem 175))         int s = x & -x;         int r = x + s;         return r | (((x^r)>>2)/s); } inline int start(int k) { // first subset         return (1<<k) - 1; } inline int limit(int n) { // first non-subset         return 1<<n; } int to_set(const std::string& s) { // string to bitset         return int(std::bitset<strsize>(s).to_ulong()); } std::string to_string(int x) { // bitset to string         return std::bitset<strsize>(x).to_string(); } int main() {         const int N = strsize; // size of user sets (string length)         const int K = 3; // size of subset         std::list<int> usersets; // all user sets converted to bitsets         for (std::string s : data) usersets.push_back(to_set(s));                 std::vector<int> counts; // one counter per subset         int b = 0; // number of K subsets of an N set (the N over K binominal)         for (int s=start(K); s<limit(N); s=next(s)) ++b;         std::cout << "*** binominal = " << b << std::endl;         counts.resize(b);         bool quit = false;         do {                 for (int& c : counts) c = 0; // reset counters                         // count all user sets for every subset                 for (int s=start(K), j=0; s<limit(N); s=next(s), ++j) {                         for (int set : usersets) {                                 if ((set & s) == s) ++counts[j];                         }                 }                         // find max counter (biggest run)                 int runSet = 0;                 for (int s=start(K), j=0, m=0; s<limit(N); s=next(s), ++j) {                         if (counts[j] > m) {                                 m = counts[j];                                 runSet = s;                         }                 }                         // print all user sets belonging to this run                 if (runSet > 0) {                                 // gather output sets                         std::vector<int> out;                         for (auto sp = usersets.begin(); sp != usersets.end();) {                                 if ((*sp & runSet) == runSet) {                                         out.push_back(*sp);                                         sp = usersets.erase(sp); // remove printed user set                                 } else ++sp;                         }                                 //actual output                         std::sort(out.rbegin(),out.rend());  // from largest to smallest                         std::cout << to_string(runSet) << " *" << std::endl;                         for (int x : out) {                                 std::cout << to_string(x) << " : " << x << std::endl;                         }                 } else quit = true; // no more runs                                 std::cout << std::endl;         } while (!quit);                 // print all remaining user sets         if (usersets.size() > 0) {                 std::cout << "*** remaining sets:" << std::endl;                 for (int x : usersets) std::cout << to_string(x) << std::endl;         } }```
producing the output
Code:

```*** binominal = 816 010000000001100000 * 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 010000000001100010 : 65634 010000000001100010 : 65634 010000000000001100 * 110000000000001100 : 196620 011000000000001100 : 98316 011000000000001100 : 98316 011000000000001100 : 98316 011000000000001100 : 98316 011000000000001100 : 98316 011000000000001100 : 98316 011000000000001100 : 98316 010000000000011100 : 65564 010000000000001110 : 65550 010000000000001110 : 65550 010000000000001110 : 65550 010000000000001110 : 65550 010000000000001110 : 65550 000000001110000000 * 100000001110000000 : 131968 010000001110000000 : 66432 010000001110000000 : 66432 010000001110000000 : 66432 000100001110000000 : 17280 000000011110000000 : 1920 000000001110010000 : 912 000000001110010000 : 912 000000001110001000 : 904 000000001110000010 : 898 000000001110000010 : 898 000000001110000001 : 897 000010000001100000 * 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000001000001100000 * 100001000001100000 : 135264 000001100001100000 : 6240 000001100001100000 : 6240 000001100001100000 : 6240 000001100001100000 : 6240 000001000001101000 : 4200 011000000000000010 * 011000000000010010 : 98322 011000000000001010 : 98314 011000000000001010 : 98314 011000000000001010 : 98314 011000000000001010 : 98314 011000000000001010 : 98314 000000001100000010 * 010000001100000010 : 66306 000000001100010010 : 786 000000001100001010 : 778 000000000000011100 * 100000000000011100 : 1311000 100000000000011100 : 1311000 000000000001101000 * 000000010001101000 : 1128 000000000001101100 : 108 000100000000001001 * 010100000000001001 : 81929 010100000000001001 : 81929 000100000000001010 * 000110000000001010 : 24586 000100000000011010 : 16410 011001000000000000 * 111001000000000000 : 233472 011001100000000000 : 104448 100100000000000010 * 110100000000000010 : 212994 100110000000000010 : 155650 100101000000000000 * 100111000000000000 : 159744 100101100000000000 : 153600 000000000000001110 * 000000010000001110 : 1038 000000100000001010 * 000001100000001010 : 6154 000001001100000000 * 100001001100000000 : 135936 000010000000001100 * 000110000000001100 : 24588 010001000000000010 * 110001000000000010 : 200706 *** remaining sets: 000000000000000000 000000000001100000```
• April 2nd, 2014, 04:04 PM
LMHmedchem
Re: question about sorting on bitstrings
I was still having some problems because of the range-based for loops but I did a little looking around and was able to compile with,

g++ -c -std=c++11 sort2_03.cpp

It looks like the default for g++ is to compile under c++98, which is a bit odd. Does anyone know if that is correct?

I suspect that the original code would compile with this as well, but thanks for making the revisions. I am now having a look at the output and will post back later.

I promise I will address all of the other questions as well, I just need a bit of time to finish up some things.

LMHmedchem
• April 2nd, 2014, 04:25 PM
razzle
Re: question about sorting on bitstrings
Quote:

Originally Posted by 2kaud
producing the output

Thank you for rewriting it in VS 2012.

The string ending with a * is the 3-bit subset that served as template for the strings collected in each run.

The numbers after : are the strings interpreted as integers. I printed them since the output is sorted based on them. It seems integers longer than 5 digits have been truncated in your output. Here's the correct output.

Code:

```*** binominal = 816 010000000001100000 * 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 011000000001100000 : 98400 010000000001100010 : 65634 010000000001100010 : 65634 010000000000001100 * 110000000000001100 : 196620 011000000000001100 : 98316 011000000000001100 : 98316 011000000000001100 : 98316 011000000000001100 : 98316 011000000000001100 : 98316 011000000000001100 : 98316 011000000000001100 : 98316 010000000000011100 : 65564 010000000000001110 : 65550 010000000000001110 : 65550 010000000000001110 : 65550 010000000000001110 : 65550 010000000000001110 : 65550 000000001110000000 * 100000001110000000 : 131968 010000001110000000 : 66432 010000001110000000 : 66432 010000001110000000 : 66432 000100001110000000 : 17280 000000011110000000 : 1920 000000001110010000 : 912 000000001110010000 : 912 000000001110001000 : 904 000000001110000010 : 898 000000001110000010 : 898 000000001110000001 : 897 000010000001100000 * 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000110000001100000 : 24672 000001000001100000 * 100001000001100000 : 135264 000001100001100000 : 6240 000001100001100000 : 6240 000001100001100000 : 6240 000001100001100000 : 6240 000001000001101000 : 4200 011000000000000010 * 011000000000010010 : 98322 011000000000001010 : 98314 011000000000001010 : 98314 011000000000001010 : 98314 011000000000001010 : 98314 011000000000001010 : 98314 000000001100000010 * 010000001100000010 : 66306 000000001100010010 : 786 000000001100001010 : 778 000000000000011100 * 100000000000011100 : 1311000 100000000000011100 : 1311000 000000000001101000 * 000000010001101000 : 1128 000000000001101100 : 108 000100000000001001 * 010100000000001001 : 81929 010100000000001001 : 81929 000100000000001010 * 000110000000001010 : 24586 000100000000011010 : 16410 011001000000000000 * 111001000000000000 : 233472 011001100000000000 : 104448 100100000000000010 * 110100000000000010 : 212994 100110000000000010 : 155650 100101000000000000 * 100111000000000000 : 159744 100101100000000000 : 153600 000000000000001110 * 000000010000001110 : 1038 000000100000001010 * 000001100000001010 : 6154 000001001100000000 * 100001001100000000 : 135936 000010000000001100 * 000110000000001100 : 24588 010001000000000010 * 110001000000000010 : 200706 *** remaining sets: 000000000000000000 000000000001100000```
• April 2nd, 2014, 04:39 PM
2kaud
Re: question about sorting on bitstrings
Quote:

It seems integers longer than 5 digits have been truncated in your output. Here's the correct output.
Yeah, you're right. Sorry. :) I mustn't have set the capture width wide enough when I was copying the output from my console window. :eek:
• April 2nd, 2014, 11:56 PM
LMHmedchem
Re: question about sorting on bitstrings
Well I was able to run this set of data and I pulled out all of the strings that were found to have common 3 bit subsets. Then I dropped the subset size to 2 and run the strings that had no common 3 bit sub sets. I think that this worked well and I like the order that resulted. I then did the analogous analysis looking at strings with 5 on bits. I first looked first for 4 bit common sub sets, and then for 3 bit common sub sets on strings that did not have 4 bits in common. So far I have processed up to strings with 6 on bits.

In this data, I have strings with as many as 12 on bits. Those are not important because there are only a few of them and they will come at the end regardless. I know that will all possible sub sets approaches, the possibilities can blow up numerically. I am wondering if that could happen here as the sub sets size increases. I think that the binomial coefficient will maximize where half the bits are on. I think there would be ~50,000 possible true cases for that but I don't quite remember the formula. 50,000 is not a particularly large number for a computer, but I am still wondering what would happen for that case. I am sure there would be some performance loss, but that is not relevant here. For me, as long as a program can finish in less than the time it takes to get coffee, it's a fast program. In other words, the program is going to be waiting on me more the me waiting on the program.

I think that this code will work and I can functionalize it to call with a sub set size.

Quote:

Originally Posted by OReubens
This looks like it's a variant of a fitting test...

what happens if you find a long run, but some of the items in that run then also appear in another run (with 4 on bits, and a test for 3, every string will be part of 4 possible blocks). Does it's value count in all 4 or only in one of the 4

Do you just want a greedy algorithm that finds the longest, then the longest of what remains, and the longest of what remains then...

or do you want an algorithm that will fidn the best distribution or the one that will result in the fewest amount of blocks in total ?

In short, I have no idea what the answer to that will be because I have no basis for determining an optimum. The reason I am doing this project is to search for the optimal pattern, or at least patterns that are consistently better than others. It is probably useful to give some context to the term "better". I am working on an AI project where the learning algorithm processes on input pattern at a time. There is no clear information at present concerning the optimal order to present the patterns, so I am working on that. It is clear that there are some orders that do not work well in this case (random, permuted), so I am looking at other methods. I have tried grouping the input patterns with hierarchical clustering and that is an improvement over random. It is clear that it is helpful for the patterns to be presented in groups that are similar.

I have begun looking at a presentation order that is theoretically simple > complex. Using the bit strings, the simplest patterns have only 1 on bit. More on bits means more complex. Within that generalization, I think that it will be useful to have the longest possible runs where the features of the patterns change very little. In the bit strings I am using, each on bit represents a feature that is present. In reality, that feature is described with a real number between 0.1-0.9. Just because two patterns have the same on bits, the will not likely have the same real number value for the features those bits describe.

One presentation method would be to present all the patterns that have the same on bits, starting with the largest group and ending with singles without 4 matching bits. Something like,

Code:

```010000000000001110 010000000000001110 010000000000001110 010000000000001110 010000000000001110 010000001110000000 010000001110000000 010000001110000000 000000001110010000 000000001110010000 110000000000001100 010000000000011100 100000001110000000```
This results in some runs of several identical patterns, but also in some orphan patterns that are presented singly and may, or may not have anything in common with the previous pattern.

If groupings of 3 out of 4 matching bits are used for the same patterns,
Code:

```010000000000001110 010000000000001110 010000000000001110 010000000000001110 010000000000001110 110000000000001100 010000000000011100 010000001110000000 010000001110000000 010000001110000000 000000001110010000 000000001110010000 100000001110000000```
you end up with longer runs and no orphans. Within each block of 3 matching bits, blocks of identical patterns are still kept together and the sort order places the largest identical block first, followed by the second largest, etc. Cases where all 4 bits are identical will still occur in blocks, but these blocks will be built on be adding other patterns with 3 out of the 4 bits. At wost, there will be only one novel feature as the learning algorithm progresses from pattern to pattern. That is the overall philosophy, to never have more than one feature that is different between sequential patterns. That is not possible with real data, so I am trying to get reasonably close.

I have no idea how this will turn out an I may have to try a number of things before I can convince myself that one method gives consistently better results. There are thousands of patterns in the data, so it is necessary to put some thought into algorithms that can be used to sort the data with some reasonable automation. I have done this kind of thing by hand with some data sets and it is as tedious and error prone as you would expect. The bigger problem is that there is no way to guarantee that you have the the order you think you do when the patterns get complex.

In some way this resembles the Tanimoto and Dice coefficients that can be used to compare bits in common and bits in difference. I don't really know much about fitting tests or coverage algorithms.

I just re-read some of the thread an noticed that razzle posted that the 9 out of 18 binominal is 48,620, so I missed that. I am still not overly worried about the time because this data set only has 8 patterns with 9 on bits. As long as there is no memory allocation issue, then I think it will be fine. I will test that tomorrow.

Making this code into a function that will sort my entire data sheet will be another challenge. I don't think I have any sort functions in my column class. I guess it is time to add some.

LMHmedchem
Show 50 post(s) from this thread on one page
Page 1 of 2 12 Last