CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Page 1 of 2 12 LastLast
Results 1 to 15 of 20
  1. #1
    Join Date
    Jul 2009
    Posts
    9

    Question std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    Hi-

    I'm experiencing a weird problem that makes me wonder if there is a problem with the std::list.sort member function when used on very large lists. I have a very large list (~20GB) that is made up of a class (382bytes each) of DNA sequence data. When the list is on the order of 12GB it sorts fine, but if it gets up into the 17GB range when I call sort() it seems to block. The list has already been built so I don't think it's a malloc issue. I also don't think it's a complexity problem because the 12GB list sorts fine (it takes a minute or so), so I'm wondering if the std sort method doesn't properly handle the excess memory available in 64bit mode? Has anyone experienced this or know a good place to ask this question?

    -thanks
    jason

  2. #2
    Join Date
    Apr 1999
    Posts
    27,449

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    Quote Originally Posted by evilxsystems View Post
    Hi-

    I'm experiencing a weird problem that makes me wonder if there is a problem with the std::list.sort member function when used on very large lists. I have a very large list (~20GB) that is made up of a class (382bytes each) of DNA sequence data. When the list is on the order of 12GB it sorts fine, but if it gets up into the 17GB range when I call sort() it seems to block.
    What is your sorting criteria? What does it look like (in code)?

    Problems with sorts can stem from writing incorrect sorting predicates. If you wrote one, please post it.

    Regards,

    Paul McKenzie

  3. #3
    Join Date
    Jul 2009
    Posts
    9

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    Code:
    bool sortFitRec0(const fitnessrecord& low, const fitnessrecord& high){
      return low.hash[0] > high.hash[0];
    }//end sortCluster
    the list is of type fitnessrecord, hash is a public array of shorts.

    -thanks
    jason

  4. #4
    Join Date
    Apr 1999
    Posts
    27,449

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    Quote Originally Posted by evilxsystems View Post
    Code:
    bool sortFitRec0(const fitnessrecord& low, const fitnessrecord& high){
      return low.hash[0] > high.hash[0];
    }//end sortCluster
    the list is of type fitnessrecord, hash is a public array of shorts.

    -thanks
    jason
    So you're always comparing the first item in the array? Is that correct?

    Secondly, please mention compiler and version that you are using.

    Third, since sorting requires copying elements, what does your fitnessrecord class look like? Is it safely copyable?

    Regards,

    Paul McKenzie

  5. #5
    Join Date
    Jul 2009
    Posts
    9

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    yes for this sort I'm just comparing the first item in the hash array...
    I'm using the Eclipse Galileo release as an IDE which I believe calls the GCC built into the Mac dev tools: i686-apple-darwin9-gcc-4.0.1

    thanks
    -jason

  6. #6
    Join Date
    Jul 2009
    Posts
    9

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    here's the class def..
    Code:
    class fitnessrecord {
    public:
    	long index;
    	short hash[MAX_HASH];
    	short numhash;
    	short lane;
    	long copynum[MAX_LANES] ;
    
    	short maxima;
    	short sortlane;
    	string sequence;
    	string qual;
    
    	string printme(void){
    		stringstream iss(stringstream::in | stringstream::out);
    		iss >> std::noskipws;
    		iss << index << "\t" << lane << "\t" << sequence << "\t" << qual << "\t";
    		for (int i=0; i < numhash; i++) iss << hash[i] << "\t";
    		for (int i=0; i < MAX_LANES; i++)iss << copynum[i] << "\t";
    		iss << maxima <<"\n";
    		return iss.str();
    
    	}//end printme
    
    	long getTotalObs(){
    		long tot=0;
    		for (int i=0; i < MAX_LANES; i++)tot+=copynum[i];
    		return tot;
    	}
    
    };//end fitnessrecord

  7. #7
    Join Date
    Apr 1999
    Posts
    27,449

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    Quote Originally Posted by evilxsystems View Post
    yes for this sort I'm just comparing the first item in the hash array...
    I'm using the Eclipse Galileo release as an IDE which I believe calls the GCC built into the Mac dev tools: i686-apple-darwin9-gcc-4.0.1

    thanks
    -jason
    OK, the last thing now is to see what your fitnessrecord class looks like, and if it has a user-defined copy ctor or assignment operator, what it does.

    There is no need to post the implementation of the functions in your class, just the declaration, and if you have written one, the copy ctor/ assignment operator for the class.

    Regards,

    Paul McKenzie

  8. #8
    Join Date
    Apr 1999
    Posts
    27,449

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    Quote Originally Posted by evilxsystems View Post
    here's the class def..
    OK, so it's safely copyable since you have just arrays, integers, and string types.

    What is the value of MAX_HASH, and how many fitnessrecord's are in your container?

    Regards,

    Paul McKenzie

  9. #9
    Join Date
    Jul 2009
    Posts
    9

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    MAX_HASH is currently set to 25...but I'm only actually filling the first 4 fields currently. Could it be an issue if I'm not properly initializing all the hash array elements? The number of fitnessrecords is approximately ~40 million.

    -thanks again
    jason

  10. #10
    Join Date
    Nov 2006
    Posts
    1,611

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    And out of curiosity (relative to the problem you're having), how much RAM is in your machine?
    If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).

  11. #11
    Join Date
    Jul 2009
    Posts
    9

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    8GB...obviously that's not enough for this so it's getting into the virtual memory, which after the list is built shows I'm using about 19 GB.

  12. #12
    Join Date
    Nov 2006
    Posts
    1,611

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    I thought so....

    I've not fully researched your posts yet, I'll join in here too, but I suspect Paul is busy with that and I'll wait on his return (my evening is about to get busy).

    Generally the STL is reliable even at high volumes like this. Paul's already mentioned some of the likely things to investigate, but you're problem may be little more than VM issues.

    For example, is the problem simply that it seems to stop?

    As the size gets larger, the performance of VM will degrade badly. A little VM dependence might add 10&#37; to the time, a little more adds 50%, a little more than that and you're suddenly seeing triple or quadruple runtime, then after that (which you're reporting is about the limit here) - it turns from hours to weeks to years - literally.

    I assume the algorithm you're running doesn't suit databases, and you intend to execute this on a machine with suitable RAM at some point?

    Why list as the container?

    Perhaps a vector of pointers?

    The reason I ask...

    For about 40 million elements, the vector of pointers would be under 1Gbyte (320 Mbytes). The data itself would still occupy room, but when using containers, if you're sorting must swap or copy entries, the copy involves the entire object, which involves VM.

    When the container is a set of pointers to your object, your storage method is more complicated, but the swap or copy during sort only involves the pointers, not the full objects. This 'lightens' the load on VM overall.

    It might be helpful here.
    If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).

  13. #13
    Join Date
    Aug 2000
    Location
    West Virginia
    Posts
    7,721

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    I don't have access to g++ source code for list right now ... but
    Visual C++ had a similar problem with sorting a very large number
    of elements. See the link below. Maybe there is a similar array
    that needs its size increased in g++. (The first two changes had to
    do with a coding error ... the last dealt with performance in sorting
    large lists).

    In the following link ... look for "fix to <list>"

    http://www.dinkumware.com/vc_fixes.html

  14. #14
    Join Date
    Jan 2009
    Posts
    1,689

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    Try letting it go over night. It could be that OSX is screwing up it's swapping for some reason and slowing down. UNIX uses lazy page swapping. In 99&#37; of the time it's faster than Window's aggressive page swapping, it's just that 1% of the time that things go screwy and it slow.

    Check your activity monitor to see what it's doing. If it's still working you should see Page-ins and Page-outs incrementing.

  15. #15
    Join Date
    Nov 2006
    Posts
    1,611

    Re: std::list.sort() fails on very large lists (>12GB) GCC Mac OS X 64-bit

    I agree with ninja9578 theory, I think this is a VM issue, assuming I understand your "block" comment as the observation that your application just doesn't seem to finish.

    I conducted an experiment, using VS2008 in XP x64 on a system with 8Gbyte RAM (it was busy at the time, too - had a VM running on a task while I tested).

    I had to open my VM limits to do this.

    I tried first using a list container, and your data construction.

    Slow, yes, but it ran at a volume of 20 million sample data.

    I tried using vector (reserved ahead of time) - memory usage drops some, sort is surprisingly no faster (a little slower), but similar result.

    Next, I tried a list<fitnessrecord *> - so, as I mentioned earlier, the swap was less burdened. This does improve timing some.

    Then, I split the data into a pair, something like

    pair< fitnesshash, fitnessrecord *>

    This would work with a class, a pair, however you like - the point is to separate the hash from the data, holding the data as a pointer.

    This had the best performance.

    The reason is observed in this behavior.

    In the list<fitnessrecord> or vector<fitnessrecord> strategy, the entire time the example ran, CPU usage was nearly zero while VM was the dominant bottlneck (and the disk was always busy). In fact, SQL was faster at sorting this volume than the C++ test.

    With the list< pair<fitnesshash, fitnessrecord *> > strategy, the CPU usage was nearly zero after the data was built, until the sort algorithm pass through it (loading it back in from VM) - essentially leaving all the other data on disk (VM). At which point sorting consumed 100&#37; CPU (of the core paying attention to the sort), disk action nearly stopped, the sort finished as quickly as smaller runs from that point, then it took it's time deleting the data (it has to load from VM to call destructors, etc).

    I was able to complete a test on 50 million samples in a few minutes from start to sorted. The shutdown of the application took longer because of the VM involvement there, but at that point the work had long been finished.

    Worth a thought.


    Oh, another strategy may also be of interest, merge sorting. The idea is to sort lists of manageable size (say 1 million), write that to disk, sort the next, then "merge" sort the collection. The merge sort is essentially "point to the first entry in each sorted collection, move the "least" of the topmost into the output, and repeat". It's very simple.

    Frankly, SQL is a good idea, too.

    Either that, or lots of RAM

    I'm about 85% sure that's the problem you're having. The time consumption is geometric with volume when VM is involved. It becomes ridiculous (as in days) at some point.
    Last edited by JVene; July 16th, 2009 at 10:33 PM.
    If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured