a variation on all possible subsets

**LMHmedchem** · February 22nd, 2017, 01:33 PM

Originally Posted by 2kaud

I don't suppose it's going to be of much help to suggest you update the compiler to a modern one...

In this particular case, no. I have plenty of versions of g++, but the some of the code I am working on is 40+ years old and at the moment, I can only get it to compile for windows with gcc3/g++3/g77. In linux, it compiles under some versions of gcc4, but not all of them. This app is hundreds of thousands of lines of code and I can add some more modern functionality to it but there are limits. I don't have 2 years to re-write the whole thing in c++5 or java, or something like that, so for now I just have to live with the eccentricities.

I am looking for a good place in your code to do my check for compatibility. It looks like this needs to be in cc(), for (i = r; i <= n; ++i) {}. Here is where I need to check each element being assigned by arr[r - 1] = i - 1; to make sure it is "compatible" with the other elements. In my code, I had a set of elements in a vector and then added a new element. It was easy to check the element being added against the elements already present. The efficiency of this was that once 0 and 1 were found to be incompatible, 0 and 1 were never created as a pair again.

For example, if 0,1 and 2,3 are incompatible, I was trying to do the following,

Code:

// create each single
0
1
2
3
4

// create each pair
0 1  // removed as 0,1 incompatible
0 2
0 3
0 4
1 2
1 3
1 4
2 3 // removed as 2,3 incompatible
2 4 
3 4

// trimmed pairs
0 2
0 3
0 4
1 2
1 3
1 4
3 4
2 4

// create each triple based on compatible pairs
// the following triples never get created because the root pairs do not advance
0 1 2
0 1 3
0 1 4
2 3 4

// when the last digit is added, some new sets are found to be incompatible
0 2 3 // removed as 2,3 incompatible
0 2 4
0 3 4
1 2 3 // removed as 2,3 incompatible
1 2 4
1 3 4

// trimmed triples
0 2 4
0 3 4
1 2 4
1 3 4

The algorithm can actually stop here because every quad or higher has either 0,1 or 2,3 in it. The way I detected this was to make a sorted unique list of the last element added at each iteration. Any element not on this list does not need to be added again. After pairs are created, 2,3,4 remain. When making the next larger sized set, the first element of the list, 2, is skipped so when making triples, we are only adding 3 or 4. After triples, only 4 remains and since 4 is already present in all the triples that would be built on, we can stop.

I don't know if this methodology makes any sense or not. I was looking for shortcuts because of the very large numbers I knew I would be running into. With your code, I think I would be building all of the possible combinations at each iteration. I could still check them to make sure there were no incompatible pairs and not keep them if there are, but I would not be making fewer and fewer combinations as I dropped out pairs and triples, etc.

Does that make any sense?

LMHmedchem

**2kaud** · February 22nd, 2017, 01:58 PM

This works well up to N=23 with 8388607 combinations. At N=24, I get a segfault. Running gdb doesn't give me a location for where the segfault occurs. I can't tell if this means that the corruption is in the heap or not.

The problem here is that all the various combinations are stored in a vector which is then manipulated to produce the required ordering. This is memory heavy - as pushing new elements to the end can cause an effective vector copy - which uses more memory and is slow.

I am looking for a good place in your code to do my check for compatibility.

At first glance, I would have looked at the contents of the vector arr before it is pushed onto the vector - and only push if it is suitable.

Can you provide an example list of incompatibilities that should be excluded so that I can do a bit of playing around with the code to see what I can come up with.

Coming back to a point made in post #6, how important is the ordering of the resultant vector? For large values of N, accepting the ordering from cc() would reduce greatly the memory required and the time taken.

**wolle** · February 22nd, 2017, 05:00 PM

Originally Posted by LMHmedchem

If I could determine the number of combinations based on N and the number of in-compatible pairs, that might help to make a better estimate. It might make more sense to just monitor the size of the final result vector and have it quit when it is getting too large.

Suggestions on how to handle this part would be appreciated as always.

You can save space by storing the sets as bit-sets. Each set will then be an integer in the [1 .. (2^n)-1] range. Also certain set operations on bit-sets are very fast.

You want the sets in a certain order (lexicographic ordered k-subsets of an n-set where k goes from n down to 1). You say this order will help you to effectively remove invalid sets but why exactly is this order an advantage? What about having the whole power-set of the n-set in lexicographic order? And what's wrong with the natural (forward or reverse) integer order? I can see certain advantages with the natural order:

Lets say the set members x and y are incompatible and all sets with these members must be excluded. This then will result in a certain number of gaps in the natural sequence of integers. Both the location and length of these gaps are known beforehand. Every incompatible pair of set members you introduce will produce new gaps. In-between these gaps there of course are sequences of valid sets. Each such sequence can be represented with a start set and an end set (just two integers really).

So when you are at a certain set you can always determine whether you have a stretch of invalid or valid sets in front of you and you know how long it is. You either jump over it in one leap or iterate through it. You can store all valid sets but you can also store just the start and end sets of the valid sequences to reduce the memory need. Or you don't store anything at all and instead generate the valid sets on the fly when ever you need them.

Another thing to consider is that when one ends up with a combinatorial solution that is bordering to the intractable it's always a good idea to go back to the drawing board and look at the problem from another angle

.

**LMHmedchem** · February 22nd, 2017, 09:50 PM

Originally Posted by 2kaud

Can you provide an example list of incompatibilities that should be excluded so that I can do a bit of playing around with the code to see what I can come up with.

Below is a modification of the code. This is based on a specific example I am working on. N=16 and I have added a vector of map which is the compatibility lookup. There is also a compatibility function, conflict_check(), and where arr is pushed into the final result vector, conflict_check(arr) is call and arr is only pushed in if conflict_check() returns true. Sorry that the map entries make this look messy. I did a vector of map because that is somewhat similar to how the actual application works where there is a vector of objects and each object contains a map.

For this case, only 48 of the possible 65536 combinations survive the conflict test. Unfortunately, all of the surviving subsets are singles or pairs. I will look for an example tomorrow that has some more interesting results.

Originally Posted by 2kaud

Coming back to a point made in post #6, how important is the ordering of the resultant vector? For large values of N, accepting the ordering from cc() would reduce greatly the memory required and the time taken.

The final result vector will be sorted after the fact, so the order produced by the algorithm is not important. The order I was using was based on my implementation I described earlier where singles were expanded to pairs, etc. It is also easier to manual check the output if it is in a rational order. The other consideration is that if part of the list must be skipped because of memory constraints, it is more important to process lower value sets that higher value. In other words, 0,1,2 is more important than 13,14,15. for this reason in my first code I made all the combinations of all sizes that started with 0 and then moved on to all the combinations that started with 1, etc.

LMHmedchem

Code:

#include <iostream>
#include <vector>
#include <functional>
#include <algorithm>
#include <map>
using namespace std;

// global lookup for compatibility
map<unsigned int,bool> tmap;
vector< map<unsigned int,bool> > compatible_lookup;


// loads global lookup map
void lookup_loader(const unsigned int &input_size) {

   // size lookup map to input size, this allows assignment of map to vector
   // position insted of using push_back
   compatible_lookup.resize(input_size);

   // load conflict map for vector element 0
   tmap[1] = false; tmap[2] = true; tmap[3] = true; tmap[4] = false; tmap[5] = false;
   tmap[6] = false; tmap[7] = false; tmap[8] = false; tmap[9] = false; tmap[10] = false;
   tmap[11] = false; tmap[12] = false; tmap[13] = true; tmap[14] = false; tmap[15] = false;
   // assign to vector of map, this puts the lookup value for element 0 at position 0
   compatible_lookup[0] = tmap;  tmap.clear();

   // load conflict map for vector element 1
   tmap[0] = false; tmap[2] = true; tmap[3] = true; tmap[4] = false; tmap[5] = false;
   tmap[6] = false; tmap[7] = false; tmap[8] = false; tmap[9] = false; tmap[10] = false;
   tmap[11] = false; tmap[12] = false; tmap[13] = true; tmap[14] = false; tmap[15] = false;
   compatible_lookup[1] = tmap;  tmap.clear();

   // load conflict map for vector element 2
   tmap[0] = true; tmap[1] = true; tmap[3] = false; tmap[4] = false; tmap[5] = false;
   tmap[6] = false; tmap[7] = false; tmap[8] = false; tmap[9] = false; tmap[10] = false;
   tmap[11] = false; tmap[12] = true; tmap[13] = false; tmap[14] = false; tmap[15] = true;
   compatible_lookup[2] = tmap;  tmap.clear();

   // load conflict map for vector element 3
   tmap[0] = true; tmap[1] = true; tmap[2] = false; tmap[4] = false; tmap[5] = false;
   tmap[6] = false; tmap[7] = false; tmap[8] = false; tmap[9] = false; tmap[10] = false;
   tmap[11] = false; tmap[12] = true; tmap[13] = false; tmap[14] = false; tmap[15] = true;
   compatible_lookup[3] = tmap;  tmap.clear();

   // load conflict map for vector element 4
   tmap[0] = false; tmap[1] = false; tmap[2] = false; tmap[3] = false; tmap[5] = false;
   tmap[6] = false; tmap[7] = false; tmap[8] = true; tmap[9] = true; tmap[10] = false;
   tmap[11] = false; tmap[12] = false; tmap[13] = true; tmap[14] = false; tmap[15] = false;
   compatible_lookup[4] = tmap;  tmap.clear();

   // load conflict map for vector element 5
   tmap[0] = false; tmap[1] = false; tmap[2] = false; tmap[3] = false; tmap[4] = false;
   tmap[6] = false; tmap[7] = false; tmap[8] = true; tmap[9] = true; tmap[10] = false;
   tmap[11] = false; tmap[12] = false; tmap[13] = true; tmap[14] = false; tmap[15] = false;
   compatible_lookup[5] = tmap;  tmap.clear();

   // load conflict map for vector element 6
   tmap[0] = false; tmap[1] = false; tmap[2] = false; tmap[3] = false; tmap[4] = false;
   tmap[5] = false; tmap[7] = false; tmap[8] = false; tmap[9] = false; tmap[10] = true;
   tmap[11] = true; tmap[12] = false; tmap[13] = false; tmap[14] = true; tmap[15] = false;
   compatible_lookup[6] = tmap;  tmap.clear();

   // load conflict map for vector element 7
   tmap[0] = false; tmap[1] = false; tmap[2] = false;tmap[3] = false; tmap[4] = false;
   tmap[5] = false; tmap[6] = false; tmap[8] = false; tmap[9] = false; tmap[10] = true;
   tmap[11] = true; tmap[12] = false; tmap[13] = false; tmap[14] = true; tmap[15] = false;
   compatible_lookup[7] = tmap;  tmap.clear();

   // load conflict map for vector element 8
   tmap[0] = false; tmap[1] = false; tmap[2] = false; tmap[3] = false; tmap[4] = true;
   tmap[5] = true; tmap[6] = false; tmap[7] = false; tmap[9] = false; tmap[10] = false;
   tmap[11] = false; tmap[12] = true; tmap[13] = false; tmap[14] = false; tmap[15] = true;
   compatible_lookup[8] = tmap;  tmap.clear();

   // load conflict map for vector element 9
   tmap[0] = false; tmap[1] = false; tmap[2] = false; tmap[3] = false; tmap[4] = true;
   tmap[5] = true; tmap[6] = false; tmap[7] = false; tmap[8] = false; tmap[10] = false;
   tmap[11] = false; tmap[12] = true; tmap[13] = false; tmap[14] = false; tmap[15] = true;
   compatible_lookup[9] = tmap;  tmap.clear();

   // load conflict map for vector element 10
   tmap[0] = false; tmap[1] = false; tmap[2] = false; tmap[3] = false; tmap[4] = false;
   tmap[5] = false; tmap[6] = true; tmap[7] = true; tmap[8] = false; tmap[9] = false;
   tmap[11] = false; tmap[12] = true; tmap[13] = true; tmap[14] = false; tmap[15] = true;
   compatible_lookup[10] = tmap;  tmap.clear();

   // load conflict map for vector element 11
   tmap[0] = false ;tmap[1] = false; tmap[2] = false; tmap[3] = false; tmap[4] = false;
   tmap[5] = false; tmap[6] = true; tmap[7] = true; tmap[8] = false; tmap[9] = false;
   tmap[10] = false; tmap[12] = true; tmap[13] = true; tmap[14] = false; tmap[15] = true;
   compatible_lookup[11] = tmap;  tmap.clear();

   // load conflict map for vector element 12
   tmap[0] = false; tmap[1] = false; tmap[2] = true; tmap[3] = true; tmap[4] = false;
   tmap[5] = false; tmap[6] = false; tmap[7] = false; tmap[8] = true; tmap[9] = true;
   tmap[10] = true; tmap[11] = true; tmap[13] = false; tmap[14] = false; tmap[15] = false;
   compatible_lookup[12] = tmap;  tmap.clear();

   // load conflict map for vector element 13
   tmap[0] = true; tmap[1] = true; tmap[2] = false; tmap[3] = false; tmap[4] = true;
   tmap[5] = true; tmap[6] = false; tmap[7] = false; tmap[8] = false; tmap[9] = false;
   tmap[10] = true; tmap[11] = true; tmap[12] = false; tmap[14] = false; tmap[15] = false;
   compatible_lookup[13] = tmap;  tmap.clear();

   // load conflict map for vector element 14
   tmap[0] = false; tmap[1] = false; tmap[2] = false; tmap[3] = false; tmap[4] = false;
   tmap[5] = false; tmap[6] = true; tmap[7] = true; tmap[8] = false; tmap[9] = false;
   tmap[10] = false; tmap[11] = false; tmap[12] = false; tmap[13] = false; tmap[15] = false;
   compatible_lookup[14] = tmap;  tmap.clear();

   // load conflict map for vector element 15
   tmap[0] = false; tmap[1] = false; tmap[2] = true; tmap[3] = true; tmap[4] = false;
   tmap[5] = false; tmap[6] = false; tmap[7] = false; tmap[8] = true;tmap[9] = true;
   tmap[10] = true; tmap[11] = true; tmap[12] = false; tmap[13] = false; tmap[14] = false;
   compatible_lookup[15] = tmap;  tmap.clear();

return;
}


// accepts a vector of int and compares each element to each other element
// to check for compatibility, returns false if any pair of elements are incompatible
bool conflict_check(vector<unsigned int> &arr) {

   // return value, true by default
   bool pass_check = true;

   // loop iterator
   unsigned int i, j;

   unsigned int vec_position, map_key;

   // if there is only one element in arr, no need to check
   if(arr.size() == 1) { return pass_check; }
   // if there are at least two elements in arr
   else {
      // check each element in arr against each other element
      for(i=arr.size(); i-- >0;) {
         // position of map in vector
         vec_position = arr[i];
         for(j=0; j<i; j++) {
            // key value
            map_key = arr[j];
            // if any incompatible pair is found, return false, no need to keep checking
            if(! compatible_lookup[vec_position][map_key]) {
               pass_check = false;
               return pass_check;
            }
         }
      }
   }

// if code reaches here, everything is compatible, return default value
return pass_check;
}


// receives two ints, a vector<int>, and a vector<vector<int>>
// populates the and a vector<vector<int>> with data from the vector<int>
void cc(unsigned int n, unsigned int r, vector<unsigned int>& arr, vector< vector<unsigned int> >& vvi) {

   // loop iterator
   unsigned int i;

   // loop from r to total_sets
   for (i = r; i <= n; ++i) {

      arr[r - 1] = i - 1;

      if (r > 1) { cc(i - 1, r - 1, arr, vvi); }
      // call function to check vector for conflicts
      // if no conflict found, add to final result vector
      else {
         if( conflict_check(arr) ) { vvi.push_back(arr); }
      }
   }

return;
}


// declares a vector<int> of size r and passes it to cc
void comb(unsigned int n, unsigned int r, vector < vector<unsigned int> >& vvi) {

   vector<unsigned int> arr(r);

   cc(n, r, arr, vvi);

return;
}


// 
void sr(vector< vector<unsigned int> >& sv, vector< vector<unsigned int> >& nv) {
   
   sort(sv.begin(), sv.end());

   // loop iterator
   vector< vector<unsigned int> >::iterator a;

   // loops and adds value at pointer
   for (a = sv.begin(); a != sv.end(); ++a) { nv.push_back(*a); }

   sv.clear();

return;
}


void comb1(unsigned int N, vector< vector<unsigned int> >& nv) {

   vector< vector<unsigned int> > vvi;
   vector< vector<unsigned int> > sv;

   // loop iterator
  unsigned int m;

   // loop from 1 to input size and call 
   for(m = 1; m <= N; ++m) { comb(N, m, vvi); }

   size_t sz = vvi.rbegin()->size();

   // loop iterator
   vector< vector<unsigned int> >::iterator vv;

   for (vv = vvi.begin(); vv != vvi.end(); ++vv) {
      if (vv->size() == sz) { sv.push_back(*vv); }
      else {
         sr(sv, nv);
         sz = vv->size();
         sv.push_back(*vv);
      }
   }

   sr(sv, nv);

return;
}


int main() {

   // size of set being processed
   const unsigned int N = 16;

   // final result vector
   vector< vector<unsigned int> > nv;

   // load global lookup map
   lookup_loader(N);

   // call function to generation all unique combinations of N
   comb1(N, nv);

   // loop iterators
   // a iterates over vector<vector<unsigned int>>
   vector< vector<unsigned int> >::iterator a;
   // b points to a, which is a specific vector<unsigned int> in nv
   vector<unsigned int>::iterator b;

   // loop over result vector<vector<unsigned int>> and print
   for(a = nv.begin(); a != nv.end(); ++a) {
      for(b = a->begin(); b != a->end(); ++b) {
         cout << *b << " ";
      }
      cout << endl;
   }

   // print final size
   cout << endl;
   cout << nv.size() << " combinations" << endl;

return 0;
}

**2kaud** · February 23rd, 2017, 05:52 AM

Sorry that the map entries make this look messy.

Having a vector of map wouldn't have been my first choice for a data structure!

Also note that if N is set to be less than 16, then there are big problems!

Consider

Code:

	compatible_lookup.resize(std::max(16U, input_size));

For this case, only 48 of the possible 65536 combinations survive the conflict test

If this ratio of reduction is repeated for higher numbers of N, then it is unlikely that memory will be an issue for the vector(s).

For the map look-up, is there a reason where there are some missing elements - which are different for different vector entries? eg tmap[0] = false; tmap[2] = true; for vector 1 - no tmap[1] ?

Note that this could result in elements being inadvertently being added to the map. Consider

Code:

    data1 = tmap[0];
    data2 = tmap[1];

For vector 1, tmap[0] exists so data1 will be the stored data. But tmap[1] doesn't exist so this entry in the map will be created with a default value of that appropriate for the type (in this case false).

This same behaviour happens even if a map is the type of a vector. eg compatible_lookup[vec_position][map_key]
where if the map with key map_key doesn't exist then one will be created.

**LMHmedchem** · February 23rd, 2017, 12:39 PM

Originally Posted by 2kaud

Having a vector of map wouldn't have been my first choice for a data structure!

I should have been more clear that this vector of map is just a temporary hard coded example to provide the conflict data required to do a compatibility check. In the application I am working on the integer values 0-N represent objects in a vector of objects. These have been sorted a on class member value so the the most important object is stored as element 0 in the vector. There is nothing like that messy map in my actual code. You asked for an example of conflicts, so these are the actual conflicts from an N=16 example that I am looking at. I put the conflicts in individual maps because they are stored and addressed from maps in the actual program.

Originally Posted by 2kaud

Also note that if N is set to be less than 16, then there are big problems!

Consider

Code:

compatible_lookup.resize(std::max(16U, input_size));

In the application I am working on, each object contains a map of compatibility to each other object. The size of each map is managed as the objects are created with one map entry for each object. In the case I used for this example, object 0 contains a map with entries for compatibility with objects 1-15. Object 1 contains a map with entries for compatibility with object 0, and 2-15, etc. This is the reason for the missing entries you noticed. I didn't think it necessary to have an entry for object 0 in the object 0 map since we will never check the compatibility of 0 against itself. If we did (by accident), the default constructed value of false would be correct.

This is meant to be a half triangle matrix where the diagonal is each element compared to itself and is ignored. I didn't know I was going to set things up this way in the beginning, and since the objects already existed, I thought that was a logical place to store the compatibility data where it could be accessed simply by knowing the element values you wanted to compare.

Code:

// vec_position is one element number and map_key is the other 
compatible_lookup[vec_position][map_key]

Originally Posted by 2kaud

This same behavior happens even if a map is the type of a vector. eg compatible_lookup[vec_position][map_key]
where if the map with key map_key doesn't exist then one will be created.

This should never happen since conflict between each object pair in the vector of objects has been tested and the results stored in the map in each object before any of the map data is read and used to make decisions. As a disclaimer, that is what I intended in the design and unfortunately implementation does not always perfectly reflect design. If this is a bad design that could be safer, I am definitely listening. Since I am trying to keep this as fast as possible, I decided to assign and evaluate the map entries directly instead of using map::insert() and map::find().

Originally Posted by 2kaud

If this ratio of reduction is repeated for higher numbers of N, then it is unlikely that memory will be an issue for the vector(s).

This is partly what I am hoping, but I do have examples that are crashing. The example I have where N=40 blows up, so the conflict reduction is not enough.

Last night, I replaced my algorithm with this new one (including the conflict check). I ran an input file that my code processed in 8 minutes. This new code took 36 minutes to process the same file. It would appear that the concept of building triples on surviving pairs has some performance potential. The N=40 example crashes my code as well. One of the problems with my algorithm is that it creates duplicates that have to be removed. For the hard coded example above, my code finds 80 combinations which are then reduced to the proper 48 with a duplicate removal function. There is no reason why duplicates should be created for something like this, so that was one thing I was trying to fix. This would speed things up more, lighten memory requirements, and also eliminate the need for the duplicate removal function which takes even more time. Is there a way to combine the set generation algorithm you suggest with the incremental building method I used? Since I may not be able to process everyting, it makes the most sense to build all the sets with root 0 first,

Code:

0
0,1
0,2
0,3
0,1,2
0,1,3
0,2,3
0,1,2,3

followed by the sets with root 1, etc. This would mean that if anything was left out because of memory constraints, it would be the combinations of the higher number elements at the end which are less important.

Originally Posted by wolle

You can save space by storing the sets as bit-sets. Each set will then be an integer in the [1 .. (2^n)-1] range. Also certain set operations on bit-sets are very fast.

Sorry that I missed your post while I was responding. I ran across some posts referencing bit algorithms for APS but I have never done anything like that so they were a bit hard to follow.

Originally Posted by wolle

ou want the sets in a certain order (lexicographic ordered k-subsets of an n-set where k goes from n down to 1). You say this order will help you to effectively remove invalid sets but why exactly is this order an advantage? What about having the whole power-set of the n-set in lexicographic order? And what's wrong with the natural (forward or reverse) integer order? I can see certain advantages with the natural order:

In my thinking, the advantage is not in removing incompatible sets but in not building them in the first place. I build 0,1 by adding 1 to 0 first testing the compatibility. If 0,1 is incompatible, 0,1 does not move forward. Triples are build by adding numbers to surviving doubles. If 0,1 survives, then 0,1,2 and 0,1,3 etc will be built. Later 0,1,2,3 will be built. If 0,1 does not survive, then nothing else starting with 0,1 will ever be built because 0,1 will not be contained in the vector on which triples are based. Am I explaining that correctly? The only way I can see to implement that is to build the subsets in a particular order so that the elements in the subset are always in increasing order. You are never adding an element of equal or lower value to an element already in the set. I have no idea if this method makes the most sense, and I generally suspect that my ideas do not make the most sense, so that is why I am posting here.

Originally Posted by wolle

Lets say the set members x and y are incompatible and all sets with these members must be excluded. This then will result in a certain number of gaps in the natural sequence of integers. Both the location and length of these gaps are known beforehand. Every incompatible pair of set members you introduce will produce new gaps. In-between these gaps there of course are sequences of valid sets. Each such sequence can be represented with a start set and an end set (just two integers really).

So when you are at a certain set you can always determine whether you have a stretch of invalid or valid sets in front of you and you know how long it is. You either jump over it in one leap or iterate through it. You can store all valid sets but you can also store just the start and end sets of the valid sequences to reduce the memory need. Or you don't store anything at all and instead generate the valid sets on the fly when ever you need them.

This is an interesting method but I have never implemented anything like that. Are there some code examples you could point met to.

Originally Posted by wolle

Another thing to consider is that when one ends up with a combinatorial solution that is bordering to the intractable it's always a good idea to go back to the drawing board and look at the problem from another angle

.

Yes, this is very well said and this is exactly why I am here. My original code processed the file I am working with in 23 minutes and I was able to get that down to 8. There are several rows that fail, I think because the the number N is too large. I am hoping to get the processing time below 8 minutes and be able to process all rows. If I can't do all the rows because of memory constraints, I would like the program to fail more gracefully that with a segfault and to still get the most reasonable output I can, even if I can't completely finishing processing the row.

LMHmedchem

**2kaud** · February 23rd, 2017, 01:32 PM

Since I may not be able to process everything, it makes the most sense to build all the sets with root 0 first,

Code:

:

0
0,1
0,2
0,3
0,1,2
0,1,3
0,2,3
0,1,2,3

That won't be straight-forward as the current code loops generating nCr where r varies from 1 to n and then processes the resulting vector to provide the required format of output. To generate the data in this format initially is going to require an algorithm change. I'll have a think about it over the weekend.

I would like the program to fail more gracefully that with a segfault and to still get the most reasonable output I can, even if I can't completely finishing processing the row

That should be possible using exceptions. I'll check it out using VS.

**2kaud** · February 23rd, 2017, 02:21 PM

I would like the program to fail more gracefully that with a segfault and to still get the most reasonable output I can, even if I can't completely finishing processing the row

This should catch this. It works in VS2015. In main

Code:

	try {
		comb1(N, nv);
	}
	catch (...)
	{
		cout << "Bad exception" << endl;
	}

As memory is an issue and you stated you're going to sort the vector later, comb1() can be reduced to

Code:

void comb1(unsigned int N, vector< vector<unsigned int> >& nv) {

	unsigned int m;

	// loop from 1 to input size and call 
	for (m = 1; m <= N; ++m) {
		comb(N, m, nv);
	}
}

With this revised comb1() code, the maximum number of allowed combinations in the vector on my system is 26906977!

PS If you don't actually need the combinations to be in a vector of vector, you could use a list of vector. On my system in this case the maximum increases to 27660222. Consider

Code:

void cc(unsigned int n, unsigned int r, vector<unsigned int>& arr, list< vector<unsigned int> >& vvi) {

	unsigned int i;

	for (i = r; i <= n; ++i) {
		arr[r - 1] = i - 1;
		if (r > 1) {
			cc(i - 1, r - 1, arr, vvi);
		}
		else {
			//if (conflict_check(arr)) {
				vvi.push_back(arr);
			//}
		}
	}

	return;
}


// declares a vector<int> of size r and passes it to cc
void comb(unsigned int n, unsigned int r, list < vector<unsigned int> >& vvi) {

	vector<unsigned int> arr(r);

	cc(n, r, arr, vvi);

	return;
}


void comb1(unsigned int N, list< vector<unsigned int> >& nv) {

	unsigned int m;

	// loop from 1 to input size and call 
	for (m = 1; m <= N; ++m) {
		comb(N, m, nv);
	}
}

int main()
{
	// size of set being processed
	const unsigned int N = 31;

	list< vector<unsigned int> > nv;

	//lookup_loader(N);

	try {
		comb1(N, nv);
	}
	catch (...)
	{
		cout << "Bad exception" << endl;
	}

	list< vector<unsigned int> >::iterator a;
	vector<unsigned int>::iterator b;

	/*
	for (a = nv.begin(); a != nv.end(); ++a) {
		for (b = a->begin(); b != a->end(); ++b) {
			cout << *b << " ";
		}
		cout << endl;
	}
	*/
	cout << endl;
	cout << nv.size() << " combinations" << endl;

	return 0;
}

PPS Do you actually need a data structure populated with the combination numbers at all? rather than cc() pushing the vector onto a container, what about calling a call-back function specified with say comb1() that is called instead? Then in the proper program each required combination can be processed without first saving them to a container - faster and no memory issues! Just a thought

**2kaud** · February 23rd, 2017, 05:32 PM

Is this the ordering you want?

Code:

0
0 1
0 1 2
0 1 2 3
0 1 2 3 4
0 1 2 4
0 1 3
0 1 3 4
0 1 4
0 2
0 2 3
0 2 3 4
0 2 4
0 3
0 3 4
0 4
1
1 2
1 2 3
1 2 3 4
1 2 4
1 3
1 3 4
1 4
2
2 3
2 3 4
2 4
3
3 4
4

31 combinations

N = 5 with no conflict check for testing.

Consider this code

Code:

void cc(unsigned int n, unsigned int r, vector<unsigned int>& arr, set< vector<unsigned int> >& vvi) {

	unsigned int i;

	for (i = r; i <= n; ++i) {
		arr[r - 1] = i - 1;
		if (r > 1) {
			cc(i - 1, r - 1, arr, vvi);
		}
		else {
			if (conflict_check(arr)) {
				vvi.insert(arr);
			}
		}
	}

	return;
}

void comb(unsigned int N, set< vector<unsigned int> >& nv) {

	unsigned int m;

	// loop from 1 to input size and call 
	for (m = 1; m <= N; ++m) {
		vector<unsigned int> arr(m);
		cc(N, m, arr, nv);
	}
}

int main()
{
	// size of set being processed
	const unsigned int N = 16;

	set< vector<unsigned int> > nv;

	lookup_loader(N);

	try {
		comb(N, nv);
	}
	catch (...)
	{
		cout << "Bad exception - Elements: " << nv.size() << endl;
	}

	set< vector<unsigned int> >::const_iterator a;
	vector<unsigned int>::const_iterator b;
	
	for (a = nv.begin(); a != nv.end(); ++a) {
		for (b = a->begin(); b != a->end(); ++b) {
			cout << *b << " ";
		}
		cout << endl;
	}
	
	cout << endl;
	cout << nv.size() << " combinations" << endl;

	return 0;
}

It stores the combinations in a set of vector - which gives the required sorted order.

With N = 16 and conflict checking used, the output is

Code:

0
0 2
0 3
0 13
1
1 2
1 3
1 13
2
2 12
2 15
3
3 12
3 15
4
4 8
4 9
4 13
5
5 8
5 9
5 13
6
6 10
6 11
6 14
7
7 10
7 11
7 14
8
8 12
8 15
9
9 12
9 15
10
10 12
10 13
10 15
11
11 12
11 13
11 15
12
13
14
15

48 combinations

**wolle** · February 24th, 2017, 01:37 AM

Originally Posted by LMHmedchem

The only way I can see to implement that is to build the subsets in a particular order so that the elements in the subset are always in increasing order.

That would be when the whole power-set of the n-set is presented in lexicographic order. 2kaud has posted an example in #24. It's probably the best order with your current approach.

This is an interesting method but I have never implemented anything like that. Are there some code examples you could point met to.

Do you mean the use of bit-sets? Well, there isn't much to it. It's just to interpret the bit pattern of an int as a set. When a bit at a certain position is 1 it means the corresponding set member is present in the set. The bit-pattern [1000101] read from right to left for example would mean the set {0,2,6}. If it's read in the other direction it would mean {0,4,6} instead. Bit-wise operations on integers can be a very fast way to perform set operations.

In the bit-pattern algoritm I suggested the power-set of the n-set is represented as integers in the [1 .. (2^n)-1] range in natural order, either ascending or descending which ever is most convenient. Then the excluded sets will appear as gaps in that range and those gaps are known in advance. So the algoritm keeps track of stretches of bit-sets/integers you either jump over or muddle through.

Well, I haven't actually tested but is should work

.

**superbonzo** · February 24th, 2017, 09:15 AM

so, if I followed the thread correctly, you want to enumerate all combinations of k elements in [0,n-1] for all 1<=k<=n satisfying some arbitrary pair wise compatibilty criterion, right ?

unless I'm missing something, a recursive c++03 off-the-top-of-my-head solution could be

Code:

#include <vector>
#include <algorithm>
#include <cassert>
#include <iostream>

enum bit_t { zero, one, indeterminate };
typedef std::vector<bit_t>      bit_vec_t;
typedef std::vector<bit_vec_t>  bit_stack_t;
typedef std::vector<bool>       bool_vec_t;
typedef std::vector<bool_vec_t> bool_map_t;

template < typename F >
void for_each_combination_impl( bit_stack_t& stack, std::size_t id, bool_map_t const& conflicts, F const& fun )
{
    typedef bit_vec_t::iterator iterator_t;
    
    assert( id && id <= stack.size() );
    
    if( id == stack.size() )
    {
        fun( stack[0] );
    }
    else if( stack[0][id] == indeterminate )
    {
        iterator_t next_on_top = stack[0].begin() + id + 1;
        iterator_t next_on_stack = stack[id].begin() + id + 1;
        
        stack[0][id] = one;
        std::copy( next_on_top, stack[0].end(), next_on_stack );
        for( std::size_t c = id + 1; c < stack.size(); ++c ) if( conflicts[id][c] ) stack[0][c] = zero;
        for_each_combination_impl( stack, id + 1, conflicts, fun );
        
        stack[0][id] = zero;
        std::copy( next_on_stack, stack[id].end(), next_on_top );
        for_each_combination_impl( stack, id + 1, conflicts, fun );
    }
    else
        for_each_combination_impl( stack, id + 1, conflicts, fun );
}

template < typename F >
void for_each_combination( bool_map_t const& conflicts, F fun )
{
    bit_stack_t stack( conflicts.size(), bit_vec_t( conflicts.size(), indeterminate ) );
    
    stack[0][0] = one;
    for( std::size_t c = 1; c < stack.size(); ++c ) if( conflicts[0][c] ) stack[0][c] = zero;
    for_each_combination_impl( stack, 1, conflicts, fun );
    
    stack[0][0] = zero;
    std::fill( stack[0].begin() + 1, stack[0].end(), indeterminate );
    for_each_combination_impl( stack, 1, conflicts, fun );
}

void print( bit_vec_t const& elements )
{
    for( std::size_t c = 0; c < elements.size(); ++c )
    {
        assert( elements[c] != indeterminate );
        
        if( elements[c] ) std::cout << c << ' ';
    }
    std::cout << '\n';
}

int main()
{
    std::size_t size = 5;
    bool_map_t  conflicts( size, bool_vec_t( size, false ) );
    
    // following post #16 ( note that i<j pairs suffice )
    conflicts[0][1] = true;
    conflicts[2][3] = true;
    
    for_each_combination( conflicts, print );
}

which outputs

Code:

obvious optimizations aside, this is optimal in the sense that the full power set is not traversed anymore.
it could be further optimized by analysing the conflicts map ( indeed, consider the map as an undirected graph, the enumerations are the cartesian product of the enumerations of each connnected component, with each transitively closed component having exactly 1+<number of vertices> enumerations, ... )

**superbonzo** · February 24th, 2017, 12:00 PM

Originally Posted by superbonzo

indeed, consider the map as an undirected graph, the enumerations are the cartesian product of the enumerations of each connnected component, with each transitively closed component having exactly 1+<number of vertices> enumerations, ...

more specifically, the number of enumerations N(G) given the conflict graph G is given by the following recursive relation ( in pseudocode )

Code:

N(G)
    C=1;
    decompose G into connected components G1,...,GN
    for each Gi
        if Gi is transitively closed, C *= 1 + count(Gi)
        otherwise, given the first vertex v of Gi
            C *= ( N( G - {v} ) + N( G - {v} - { all vertices connected to v } ) )
    return C

you may use this to see if a given input is tractable or not
moreover, by following the above you can build a tree of subgraphs, by virtue of which the traversal can be further optimized ...

**LMHmedchem** · February 28th, 2017, 01:05 AM

Over the weekend I re-did my algorithm to get it working the way I thought it should.This was before some of the more recent posts, so I apologize that those I ideas have not been tested yet. I wanted to get my current code working to see how well it did before I compared it to other approaches. I will also test the suggestion of superbonzo, 2kaud, and wolle for comparison. I am working on a data set with 358 sets with N values ranging from 2 to 40. With my original code, this set took 23 minutes to run. I managed to get that down to 8 minutes, but it still wasn't working quite right (thus this thread). The code suggested by 2kaud in post 13 that creates all of the possible subsets in sequence and then checks them takes 36 minutes to process the same data and there are about 30 of the 358 sets that segfault because the 2^N problem is overrunning something in the library code. That is to be expected because 1,099,511,627,776 combinations (2^40) is more than a bit more than can be addressed in 32-bit.

At any rate, I got the algorithm I had in mind working properly and I can now process the same data set in 23 seconds. The optimization comes from not creating the vast majority of the subsets. Branches with incompatible pairs are eliminated early before they exponentially branch out. For the 2^40 case, only 3,708,590 of the theoretical 1,099,511,627,776 combinations are examined and only 178,211 are in the final output.

This is a synopsis of how the code works based on the example of N=5, {0,1,2,3,4}

1. each element i in the set N forms a branch where Ni is the first element in each subset. This is called the root_element an the first root_element will be element 0, followed by element 1, etc.

2. the root_element is added to the final_subsets[] container by itself

Code:

3. the root_element forms the base of N-1 pairs which form the first level of the branch. These are created by adding N-1 vectors to a vec<vec<int>> called root_elements[] where each vector contains just the root_element.

Code:

vector< vector<int> > root_elements;
0
0
0
0

4. a vector<int> called elements_to_add[] is created containing all the elements in N > root_element

Code:

vector<int> elements_to_add
{1,2,3,4}

5. pair subsets are generated by adding an element from elements_to_add[] to an existing vector in root_elements[]. Before and element from elements_to_add[] is added, a check is made calling a check function to see if the added element is compatible with the existing element. If the elements are compatible, the element is added and the expanded vector is copied into a result vec<vec<int>> called extended_sets[]. Each element that was successfully added to a vector is saved in a vector<int> called last_added_elements[].

Code:

// if all checks passed, then all 4 possible subsets would be created in extended_sets
vector< vector<int> > extended_sets;
0,1
0,2
0,3
0,4

vector<int> last_added_elements;
{1,2,3,4}

6. root_elements[] and elements_to_add[] are both cleared.

7. the resulting extended_sets[] container is copied into the final_subsets[] container.

8. the extended_sets[] container is copied to become the root_elements[] set for the next iteration.

9. the last_added_elements[] vector is copied to become the elements_to_add[] set for the next iteration.

10. extended_sets[] and last_added_elements[] are both cleared.

11. the next iteration forms the next level of the branch by using the pairs in root_elements[] to build triples by adding each element in elements_to_add[] to each element of root_elements[]. Before each element is added it is tested to make sure that the element is not already in the set of elements being added to and is not smaller than the last element in the set being added to.

Code:

root_elements[]
0,1
0,2
0,3
0,4
 elements_to_add[]
{1,2,3,4}

// check the following
0,1 add 1 // fails 1 is already a member of 0,1
0,1 add 2
0,1 add 3
0,1 add 4

0,2 add 1 // fails 1 is smaller than 2
0,2 add 2 // fails 2 is already a member of 0,2
0,2 add 3
0,2 add 4

0,3 add 1 // fails 1 is smaller than 3
0,3 add 2 // fails 2 is smaller than 3
0,3 add 3 // fails 3 is already a member of 0,3
0,3 add 4

0,4 add 1 // fails 1 is smaller than 3
0,4 add 2 // fails 2 is smaller than 3
0,4 add 3 // fails 3 is smaller than 3
0,4 add 4 // fails 4 is already a member of 0,4

// passing triples are added to extended_sets[]
extended_sets[]
0,1,2
0,1,3
0,1,4
0,2,3
0,2,4
0,3,4

// added elements are sorted and duplicates removed with unique()
last added elements[]
{2,3,4}

12. If the addition of an element passes the first two checks, the element is also checked for compatibility with all elements already in the subset using the check function. Passing sets are added to extended_sets[] to form the basis for creating quads.

13. elements that were successfully added are saved in last_added_elements[] which is sorted and duplicates are removed with unique(). Note that 1 is no longer in last_added_elements[] because it was not added to any expanded subset.

The loop continues until there are no elements left on the last_added_elements[] list to use to make expanded subsets.

If there are no incompatible elements, the full output for the element 0 branch is the following 16 subsets,

Code:

final_subsets[]
0,
0,1
0,2
0,3
0,4
0,1,2
0,1,3
0,1,4
0,2,3
0,2,4
0,3,4
0,1,2,3
0,1,2,4
0,1,3,4
0,2,3,4
0,1,2,3,4

This is certainly not the most efficient way to generate these subsets but there are advantages when N is large and there are in compatible elements. This post is getting too long, so I will show these advantages and post my code in the next post.

LMHmedchem

**2kaud** · February 28th, 2017, 05:27 AM

At any rate, I got the algorithm I had in mind working properly and I can now process the same data set in 23 seconds

That's great!

**wolle** · March 1st, 2017, 01:11 AM

Originally Posted by LMHmedchem

At any rate, I got the algorithm I had in mind working properly and I can now process the same data set in 23 seconds. The optimization comes from not creating the vast majority of the subsets. Branches with incompatible pairs are eliminated early before they exponentially branch out. For the 2^40 case, only 3,708,590 of the theoretical 1,099,511,627,776 combinations are examined and only 178,211 are in the final output.

That should be fast enough for all practical purposes. Still I've done some "recreational" programming to check the feasibility of my suggestion to treat the sets as bit-sets.

Using bit-sets means a set has a dual representation both as a bit pattern and also as a natural number (an integer). This is an output from my program with the same data as superbonzo used in #26. The bit pattern representation [] is to the left, then follows the set representation {} and finally the integer representation to the right ():

Code:

*** skipsets ***
[11000] - {0,1} - (24)
[00110] - {2,3} - (6)
*** trace ***
skip: [11111] - {0,1,2,3,4} - (31)
skip: [11110] - {0,1,2,3} - (30)
skip: [11101] - {0,1,2,4} - (29)
skip: [11100] - {0,1,2} - (28)
skip: [11011] - {0,1,3,4} - (27)
skip: [11010] - {0,1,3} - (26)
skip: [11001] - {0,1,4} - (25)
skip: [11000] - {0,1} - (24)
skip: [10111] - {0,2,3,4} - (23)
skip: [10110] - {0,2,3} - (22)
      [10101] - {0,2,4} - (21)
      [10100] - {0,2} - (20)
      [10011] - {0,3,4} - (19)
      [10010] - {0,3} - (18)
      [10001] - {0,4} - (17)
      [10000] - {0} - (16)
skip: [01111] - {1,2,3,4} - (15)
skip: [01110] - {1,2,3} - (14)
      [01101] - {1,2,4} - (13)
      [01100] - {1,2} - (12)
      [01011] - {1,3,4} - (11)
      [01010] - {1,3} - (10)
      [01001] - {1,4} - (9)
      [01000] - {1} - (8)
skip: [00111] - {2,3,4} - (7)
skip: [00110] - {2,3} - (6)
      [00101] - {2,4} - (5)
      [00100] - {2} - (4)
      [00011] - {3,4} - (3)
      [00010] - {3} - (2)
      [00001] - {4} - (1)
*** result ***
[10101] - {0,2,4} - (21)
[10100] - {0,2} - (20)
[10011] - {0,3,4} - (19)
[10010] - {0,3} - (18)
[10001] - {0,4} - (17)
[10000] - {0} - (16)
[01101] - {1,2,4} - (13)
[01100] - {1,2} - (12)
[01011] - {1,3,4} - (11)
[01010] - {1,3} - (10)
[01001] - {1,4} - (9)
[01000] - {1} - (8)
[00101] - {2,4} - (5)
[00100] - {2} - (4)
[00011] - {3,4} - (3)
[00010] - {3} - (2)
[00001] - {4} - (1)

Interestingly the result came out in the same order as did superbonzo's.

To be continued in a second post.

Thread: a variation on all possible subsets

Thread Tools

Display

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Re: a variation on all possible subsets

Posting Permissions