-
July 16th, 2009, 06:00 PM
#1
std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
Hi-
I'm experiencing a weird problem that makes me wonder if there is a problem with the std::list.sort member function when used on very large lists. I have a very large list (~20GB) that is made up of a class (382bytes each) of DNA sequence data. When the list is on the order of 12GB it sorts fine, but if it gets up into the 17GB range when I call sort() it seems to block. The list has already been built so I don't think it's a malloc issue. I also don't think it's a complexity problem because the 12GB list sorts fine (it takes a minute or so), so I'm wondering if the std sort method doesn't properly handle the excess memory available in 64bit mode? Has anyone experienced this or know a good place to ask this question?
-thanks
jason
-
July 16th, 2009, 06:11 PM
#2
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
Originally Posted by evilxsystems
Hi-
I'm experiencing a weird problem that makes me wonder if there is a problem with the std::list.sort member function when used on very large lists. I have a very large list (~20GB) that is made up of a class (382bytes each) of DNA sequence data. When the list is on the order of 12GB it sorts fine, but if it gets up into the 17GB range when I call sort() it seems to block.
What is your sorting criteria? What does it look like (in code)?
Problems with sorts can stem from writing incorrect sorting predicates. If you wrote one, please post it.
Regards,
Paul McKenzie
-
July 16th, 2009, 06:18 PM
#3
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
Code:
bool sortFitRec0(const fitnessrecord& low, const fitnessrecord& high){
return low.hash[0] > high.hash[0];
}//end sortCluster
the list is of type fitnessrecord, hash is a public array of shorts.
-thanks
jason
-
July 16th, 2009, 06:25 PM
#4
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
Originally Posted by evilxsystems
Code:
bool sortFitRec0(const fitnessrecord& low, const fitnessrecord& high){
return low.hash[0] > high.hash[0];
}//end sortCluster
the list is of type fitnessrecord, hash is a public array of shorts.
-thanks
jason
So you're always comparing the first item in the array? Is that correct?
Secondly, please mention compiler and version that you are using.
Third, since sorting requires copying elements, what does your fitnessrecord class look like? Is it safely copyable?
Regards,
Paul McKenzie
-
July 16th, 2009, 06:31 PM
#5
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
yes for this sort I'm just comparing the first item in the hash array...
I'm using the Eclipse Galileo release as an IDE which I believe calls the GCC built into the Mac dev tools: i686-apple-darwin9-gcc-4.0.1
thanks
-jason
-
July 16th, 2009, 06:35 PM
#6
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
here's the class def..
Code:
class fitnessrecord {
public:
long index;
short hash[MAX_HASH];
short numhash;
short lane;
long copynum[MAX_LANES] ;
short maxima;
short sortlane;
string sequence;
string qual;
string printme(void){
stringstream iss(stringstream::in | stringstream::out);
iss >> std::noskipws;
iss << index << "\t" << lane << "\t" << sequence << "\t" << qual << "\t";
for (int i=0; i < numhash; i++) iss << hash[i] << "\t";
for (int i=0; i < MAX_LANES; i++)iss << copynum[i] << "\t";
iss << maxima <<"\n";
return iss.str();
}//end printme
long getTotalObs(){
long tot=0;
for (int i=0; i < MAX_LANES; i++)tot+=copynum[i];
return tot;
}
};//end fitnessrecord
-
July 16th, 2009, 06:36 PM
#7
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
Originally Posted by evilxsystems
yes for this sort I'm just comparing the first item in the hash array...
I'm using the Eclipse Galileo release as an IDE which I believe calls the GCC built into the Mac dev tools: i686-apple-darwin9-gcc-4.0.1
thanks
-jason
OK, the last thing now is to see what your fitnessrecord class looks like, and if it has a user-defined copy ctor or assignment operator, what it does.
There is no need to post the implementation of the functions in your class, just the declaration, and if you have written one, the copy ctor/ assignment operator for the class.
Regards,
Paul McKenzie
-
July 16th, 2009, 06:39 PM
#8
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
Originally Posted by evilxsystems
here's the class def..
OK, so it's safely copyable since you have just arrays, integers, and string types.
What is the value of MAX_HASH, and how many fitnessrecord's are in your container?
Regards,
Paul McKenzie
-
July 16th, 2009, 06:46 PM
#9
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
MAX_HASH is currently set to 25...but I'm only actually filling the first 4 fields currently. Could it be an issue if I'm not properly initializing all the hash array elements? The number of fitnessrecords is approximately ~40 million.
-thanks again
jason
-
July 16th, 2009, 06:52 PM
#10
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
And out of curiosity (relative to the problem you're having), how much RAM is in your machine?
If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).
-
July 16th, 2009, 06:54 PM
#11
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
8GB...obviously that's not enough for this so it's getting into the virtual memory, which after the list is built shows I'm using about 19 GB.
-
July 16th, 2009, 07:02 PM
#12
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
I thought so....
I've not fully researched your posts yet, I'll join in here too, but I suspect Paul is busy with that and I'll wait on his return (my evening is about to get busy).
Generally the STL is reliable even at high volumes like this. Paul's already mentioned some of the likely things to investigate, but you're problem may be little more than VM issues.
For example, is the problem simply that it seems to stop?
As the size gets larger, the performance of VM will degrade badly. A little VM dependence might add 10% to the time, a little more adds 50%, a little more than that and you're suddenly seeing triple or quadruple runtime, then after that (which you're reporting is about the limit here) - it turns from hours to weeks to years - literally.
I assume the algorithm you're running doesn't suit databases, and you intend to execute this on a machine with suitable RAM at some point?
Why list as the container?
Perhaps a vector of pointers?
The reason I ask...
For about 40 million elements, the vector of pointers would be under 1Gbyte (320 Mbytes). The data itself would still occupy room, but when using containers, if you're sorting must swap or copy entries, the copy involves the entire object, which involves VM.
When the container is a set of pointers to your object, your storage method is more complicated, but the swap or copy during sort only involves the pointers, not the full objects. This 'lightens' the load on VM overall.
It might be helpful here.
If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).
-
July 16th, 2009, 07:04 PM
#13
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
I don't have access to g++ source code for list right now ... but
Visual C++ had a similar problem with sorting a very large number
of elements. See the link below. Maybe there is a similar array
that needs its size increased in g++. (The first two changes had to
do with a coding error ... the last dealt with performance in sorting
large lists).
In the following link ... look for "fix to <list>"
http://www.dinkumware.com/vc_fixes.html
-
July 16th, 2009, 08:38 PM
#14
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
Try letting it go over night. It could be that OSX is screwing up it's swapping for some reason and slowing down. UNIX uses lazy page swapping. In 99% of the time it's faster than Window's aggressive page swapping, it's just that 1% of the time that things go screwy and it slow.
Check your activity monitor to see what it's doing. If it's still working you should see Page-ins and Page-outs incrementing.
-
July 16th, 2009, 10:27 PM
#15
Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit
I agree with ninja9578 theory, I think this is a VM issue, assuming I understand your "block" comment as the observation that your application just doesn't seem to finish.
I conducted an experiment, using VS2008 in XP x64 on a system with 8Gbyte RAM (it was busy at the time, too - had a VM running on a task while I tested).
I had to open my VM limits to do this.
I tried first using a list container, and your data construction.
Slow, yes, but it ran at a volume of 20 million sample data.
I tried using vector (reserved ahead of time) - memory usage drops some, sort is surprisingly no faster (a little slower), but similar result.
Next, I tried a list<fitnessrecord *> - so, as I mentioned earlier, the swap was less burdened. This does improve timing some.
Then, I split the data into a pair, something like
pair< fitnesshash, fitnessrecord *>
This would work with a class, a pair, however you like - the point is to separate the hash from the data, holding the data as a pointer.
This had the best performance.
The reason is observed in this behavior.
In the list<fitnessrecord> or vector<fitnessrecord> strategy, the entire time the example ran, CPU usage was nearly zero while VM was the dominant bottlneck (and the disk was always busy). In fact, SQL was faster at sorting this volume than the C++ test.
With the list< pair<fitnesshash, fitnessrecord *> > strategy, the CPU usage was nearly zero after the data was built, until the sort algorithm pass through it (loading it back in from VM) - essentially leaving all the other data on disk (VM). At which point sorting consumed 100% CPU (of the core paying attention to the sort), disk action nearly stopped, the sort finished as quickly as smaller runs from that point, then it took it's time deleting the data (it has to load from VM to call destructors, etc).
I was able to complete a test on 50 million samples in a few minutes from start to sorted. The shutdown of the application took longer because of the VM involvement there, but at that point the work had long been finished.
Worth a thought.
Oh, another strategy may also be of interest, merge sorting. The idea is to sort lists of manageable size (say 1 million), write that to disk, sort the next, then "merge" sort the collection. The merge sort is essentially "point to the first entry in each sorted collection, move the "least" of the topmost into the output, and repeat". It's very simple.
Frankly, SQL is a good idea, too.
Either that, or lots of RAM
I'm about 85% sure that's the problem you're having. The time consumption is geometric with volume when VM is involved. It becomes ridiculous (as in days) at some point.
Last edited by JVene; July 16th, 2009 at 10:33 PM.
If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|