std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

**evilxsystems** · July 16th, 2009, 06:00 PM

Hi-

I'm experiencing a weird problem that makes me wonder if there is a problem with the std::list.sort member function when used on very large lists. I have a very large list (~20GB) that is made up of a class (382bytes each) of DNA sequence data. When the list is on the order of 12GB it sorts fine, but if it gets up into the 17GB range when I call sort() it seems to block. The list has already been built so I don't think it's a malloc issue. I also don't think it's a complexity problem because the 12GB list sorts fine (it takes a minute or so), so I'm wondering if the std sort method doesn't properly handle the excess memory available in 64bit mode? Has anyone experienced this or know a good place to ask this question?

-thanks
jason

**Paul McKenzie** · July 16th, 2009, 06:11 PM

Originally Posted by evilxsystems

Hi-

I'm experiencing a weird problem that makes me wonder if there is a problem with the std::list.sort member function when used on very large lists. I have a very large list (~20GB) that is made up of a class (382bytes each) of DNA sequence data. When the list is on the order of 12GB it sorts fine, but if it gets up into the 17GB range when I call sort() it seems to block.

What is your sorting criteria? What does it look like (in code)?

Problems with sorts can stem from writing incorrect sorting predicates. If you wrote one, please post it.

Regards,

Paul McKenzie

**evilxsystems** · July 16th, 2009, 06:18 PM

Code:

bool sortFitRec0(const fitnessrecord& low, const fitnessrecord& high){
  return low.hash[0] > high.hash[0];
}//end sortCluster

the list is of type fitnessrecord, hash is a public array of shorts.

-thanks
jason

**Paul McKenzie** · July 16th, 2009, 06:25 PM

Originally Posted by evilxsystems

Code:

bool sortFitRec0(const fitnessrecord& low, const fitnessrecord& high){
  return low.hash[0] > high.hash[0];
}//end sortCluster

the list is of type fitnessrecord, hash is a public array of shorts.

-thanks
jason

So you're always comparing the first item in the array? Is that correct?

Secondly, please mention compiler and version that you are using.

Third, since sorting requires copying elements, what does your fitnessrecord class look like? Is it safely copyable?

Regards,

Paul McKenzie

**evilxsystems** · July 16th, 2009, 06:31 PM

yes for this sort I'm just comparing the first item in the hash array...
I'm using the Eclipse Galileo release as an IDE which I believe calls the GCC built into the Mac dev tools: i686-apple-darwin9-gcc-4.0.1

thanks
-jason

**evilxsystems** · July 16th, 2009, 06:35 PM

here's the class def..

Code:

class fitnessrecord {
public:
	long index;
	short hash[MAX_HASH];
	short numhash;
	short lane;
	long copynum[MAX_LANES] ;

	short maxima;
	short sortlane;
	string sequence;
	string qual;

	string printme(void){
		stringstream iss(stringstream::in | stringstream::out);
		iss >> std::noskipws;
		iss << index << "\t" << lane << "\t" << sequence << "\t" << qual << "\t";
		for (int i=0; i < numhash; i++) iss << hash[i] << "\t";
		for (int i=0; i < MAX_LANES; i++)iss << copynum[i] << "\t";
		iss << maxima <<"\n";
		return iss.str();

	}//end printme

	long getTotalObs(){
		long tot=0;
		for (int i=0; i < MAX_LANES; i++)tot+=copynum[i];
		return tot;
	}

};//end fitnessrecord

**Paul McKenzie** · July 16th, 2009, 06:36 PM

Originally Posted by evilxsystems

yes for this sort I'm just comparing the first item in the hash array...
I'm using the Eclipse Galileo release as an IDE which I believe calls the GCC built into the Mac dev tools: i686-apple-darwin9-gcc-4.0.1

thanks
-jason

OK, the last thing now is to see what your fitnessrecord class looks like, and if it has a user-defined copy ctor or assignment operator, what it does.

There is no need to post the implementation of the functions in your class, just the declaration, and if you have written one, the copy ctor/ assignment operator for the class.

Regards,

Paul McKenzie

**ninja9578** · July 16th, 2009, 08:38 PM

Try letting it go over night. It could be that OSX is screwing up it's swapping for some reason and slowing down. UNIX uses lazy page swapping. In 99% of the time it's faster than Window's aggressive page swapping, it's just that 1% of the time that things go screwy and it slow.

Check your activity monitor to see what it's doing. If it's still working you should see Page-ins and Page-outs incrementing.

**JVene** · July 16th, 2009, 10:27 PM

I agree with ninja9578 theory, I think this is a VM issue, assuming I understand your "block" comment as the observation that your application just doesn't seem to finish.

I conducted an experiment, using VS2008 in XP x64 on a system with 8Gbyte RAM (it was busy at the time, too - had a VM running on a task while I tested).

I had to open my VM limits to do this.

I tried first using a list container, and your data construction.

Slow, yes, but it ran at a volume of 20 million sample data.

I tried using vector (reserved ahead of time) - memory usage drops some, sort is surprisingly no faster (a little slower), but similar result.

Next, I tried a list<fitnessrecord *> - so, as I mentioned earlier, the swap was less burdened. This does improve timing some.

Then, I split the data into a pair, something like

pair< fitnesshash, fitnessrecord *>

This would work with a class, a pair, however you like - the point is to separate the hash from the data, holding the data as a pointer.

This had the best performance.

The reason is observed in this behavior.

In the list<fitnessrecord> or vector<fitnessrecord> strategy, the entire time the example ran, CPU usage was nearly zero while VM was the dominant bottlneck (and the disk was always busy). In fact, SQL was faster at sorting this volume than the C++ test.

With the list< pair<fitnesshash, fitnessrecord *> > strategy, the CPU usage was nearly zero after the data was built, until the sort algorithm pass through it (loading it back in from VM) - essentially leaving all the other data on disk (VM). At which point sorting consumed 100% CPU (of the core paying attention to the sort), disk action nearly stopped, the sort finished as quickly as smaller runs from that point, then it took it's time deleting the data (it has to load from VM to call destructors, etc).

I was able to complete a test on 50 million samples in a few minutes from start to sorted. The shutdown of the application took longer because of the VM involvement there, but at that point the work had long been finished.

Worth a thought.

Oh, another strategy may also be of interest, merge sorting. The idea is to sort lists of manageable size (say 1 million), write that to disk, sort the next, then "merge" sort the collection. The merge sort is essentially "point to the first entry in each sorted collection, move the "least" of the topmost into the output, and repeat". It's very simple.

Frankly, SQL is a good idea, too.

Either that, or lots of RAM

I'm about 85% sure that's the problem you're having. The time consumption is geometric with volume when VM is involved. It becomes ridiculous (as in days) at some point.

**evilxsystems** · July 17th, 2009, 01:48 AM

Thanks for all your replies guys! I think you're probably correct about the VM. I had thought it wasn't an issue since the 12GB list that sorted fine was already using a large chunk of VM. The 20GB list did "seem" to block, but I didn't watch the activity monitor that carefully. I did try letting it go overnight, but in the morning the machine was locked and I had to hard reset. I was able to get a 17GB list to sort this afternoon, although at much slower speed (took about 45 minutes) so it does seem like I need to be smarter about the sort or buy some more RAM

Since I really only need to sort the hashes I'll just sort a subclass that contains only the hash and file indexes and just rebuild the fitnessrecord class list from their storage files after the list is sorted...

But why does sort() need to copy the entire class though? It's a doubly linked list, doesn't sort just swap the pointers?

thanks again!
jason

**JVene** · July 17th, 2009, 11:02 AM

Locked? Sounds like ninja9578 has a point, 99% great, but then.....

VM is like black magic that can backfire. It shouldn't really lock, but then the OS can get twisted in knots because there's more going on than a single application, even in a quiet machine. During heavy paging demand, the OS itself can get stuck for a long time (it should 'come back', but it can take inhuman patience). Highly suggestive that it's more trouble than it's worth to depend on it for what is over double your actual RAM. It's better at paging away whole application spaces that can rest, dormant (relatively) until the pressure is off, but to run a sort within paged space, it's awful.

One reason, the pages aren't even close to the size and boundary of the targets. Pages are typically 4K ( though I assume this is OS/Platform dependent). If you're bouncing around RAM picking up a few hundred bytes at a time, this would mean it's loading/dumping 4K blocks to check an 8 byte hash in objects scattered all around. A large problem is made even worse.

I'll just sort a subclass that contains only the hash and file indexes and just rebuild the fitnessrecord class list from their storage files after the list is sorted...

Strategies like this are going to yield huge dividends. SQL, depending on the engine and platform, could 'sort' a database like this (40 million or so) in under 2 hours on typical hardware (say a dual core 2.5Ghz with a few G of RAM) - some, of course, can handle it in 1/3 that time or faster.
[/quote]

But why does sort() need to copy the entire class though? It's a doubly linked list, doesn't sort just swap the pointers?

Well, that's the rub about STL containers. The link in that list is designed something like:

Code:

template<typename T> link
{
 link<T> *next, *prev;

 T Data;
};

That is, obviously, a loose paraphrase, but the point is that the data is incorporated into the link. Now, yes, you'd think it could just manipulate links, but it's pushing new links into a destination, performing splits and merges - I haven't researched how my version does it exactly (I can think of at least 3 approaches I'd test for performance at the moment), but each version of the STL is free to choose whatever method the author likes.

Obviously the memory overhead is kept small, so there isn't a destination created with the original stays in RAM, and it may not be every element that's copied, only those that are certain targets in "split" or "merge" or some such algorithms.

However, it's not just the copy, it's the bouncing around. If the class is 382 bytes, and there's 40 million on hand, that's going to put around (or really over) half of the data in VM. As the list must bounce all over the source for comparisons, VM is going to read in the whole object just to check the hash (and, whatever else is in the 4K page with it - plus, if the object is on a boundary 'between' pages, that's 8K it reads to check the hash).

Of course, this works out over average scenarios. There can be times when it runs through a few thousand quickly because they happen to be in RAM, then it runs through a few hundred that aren't (now paging) at a speed less than 1/1000th as fast. The larger the overrun, the worse this is to the point that execution time is exploding beyond mere geometric orders, it's got to reach quantum physics/galactic levels of impracticality.

So, even if the container isn't copying every object, the VM is copying the object to/from RAM at every swap, in or out. Speed drops from, what is RAM these days, over 2Gbytes per second down to our best average of about 100 Mbytes per second, but wait - there's more! 4K (or 8K) is hardly what's on a cylinder, so the page is reading a tiny portion of the disc's spin capacity (the amount that could be read on a single rotation). Put in seek time to find the right track, the latency for the target data to sweep under the head, and add that to every page operation - read/write speed must be dropping to below the 10Mbyte (maybe below 1Mbyte) per second speed.

If you had data in larger blocks, and could process in 'chunks' that fit inside RAM, it's much more practicable - since you're essentially keeping work inside a RAM sized space, paging isn't going to occur (potentially) at every object comparison, but only between 'groups'. I say this because at some point even a 'pair' construction (limiting the sort to just the has + pointer) has a limit on an 8 Gbyte machine somewhere in the few hundred million range. If your task gets larger, it's an algorithm shift that's going to make sense instead of a storage shift.

Choice like this are all about compromises for certain target limits. You're on the right track with the "hash-pointer pair" concept for the limits discussed here, so I don't think you have to think hard about that this time. If your RAM limit were much smaller, or the your target size grows, then you will have to think about partitioning the work into groups that fit into RAM, instead of partitioning the storage of the entire job. The old 3 tape merge sort comes to mind as an example.

**Lindley** · July 17th, 2009, 01:09 PM

Originally Posted by JVene

Oh, another strategy may also be of interest, merge sorting. The idea is to sort lists of manageable size (say 1 million), write that to disk, sort the next, then "merge" sort the collection. The merge sort is essentially "point to the first entry in each sorted collection, move the "least" of the topmost into the output, and repeat". It's very simple.

MergeSort definitely has better spatial locality than QuickSort, which will certainly help in a VM situation, and both have the same asymptotic bound.

**evilxsystems** · July 17th, 2009, 01:46 PM

well I just ordered another 16GB of RAM

but I'm also going to strip down the class for the sort...I can't expect someone to have 24GB of RAM, unless of course I start working for the windows 8 dev team.....

thanks all!

-jason

**evilxsystems** · July 17th, 2009, 01:50 PM

also the total data size is fairly limited by the current DNA sequencing technology, so I expect I won't need to scale up over a factor of 3 or so in the next few years...so if I can stably sort something on the order of 100 million records it'll probably be fine.

Thread: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Thread Tools

Display

Hybrid View

std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

Posting Permissions