-
May 30th, 2012, 01:06 AM
#1
Sort Huge files
I am trying to understand what techiques can be used to sort really huge files (larger than available memory). I did some googling and came across one technique.
1. Are there any better ways to get this done?
2. Is there some tweaking that can be done to make this itself better?
large enough so that you get a lot of records, but small enough such that it will comfortably fit into memory
3. How do you decide on this value? Consider, memory is 4 GB and currently about 2GB is consumed, and file to sort is 10GB in size. (Consumed memory could of course change dynamically during execution - consumed more/less by other apps.)
-
May 30th, 2012, 04:18 AM
#2
Re: Sort Huge files
Originally Posted by ShaQ.Blogs
I am trying to understand what techiques can be used to sort really huge files (larger than available memory).
The technique has been known for decades. It's called an external merge sort:
http://en.wikipedia.org/wiki/External_sorting
That site you found describes nothing more than what has been known and used for the past 40+ years.
Regards,
Paul McKenzie
Last edited by Paul McKenzie; May 30th, 2012 at 04:25 AM.
-
May 30th, 2012, 08:24 AM
#3
Re: Sort Huge files
Sounds like it might be time to consider a database.
-
May 30th, 2012, 09:38 AM
#4
Re: Sort Huge files
I'd say the big question is: "Do you HAVE to use a comparison based sort?".
Depending on the elements being sorted, I'm pretty sure a Bucket/Radix sort can beat the crap out of any sized input.
Is your question related to IO?
Read this C++ FAQ article at parashift by Marshall Cline. In particular points 1-6.
It will explain how to correctly deal with IO, how to validate input, and why you shouldn't count on "while(!in.eof())". And it always makes for excellent reading.
-
May 30th, 2012, 10:03 AM
#5
Re: Sort Huge files
I agree that a radix sort would likely be the best bet here. You do not need to sort things in RAM, you can push large amounts of data off to the hard drive. A Radix sort would be useful here because you can put the buckets on the hard drive, then load small buckets into RAM and do a quicksort on those buckets, then just combine them when you are done iterating each bucket.
-
May 31st, 2012, 04:15 AM
#6
Re: Sort Huge files
Originally Posted by Paul McKenzie
Thanks for the link & info
I'd say the big question is: "Do you HAVE to use a comparison based sort?".
Depending on the elements being sorted, I'm pretty sure a Bucket/Radix sort can beat the crap out of any sized input.
I agree that a radix sort would likely be the best bet here. You do not need to sort things in RAM, you can push large amounts of data off to the hard drive. A Radix sort would be useful here because you can put the buckets on the hard drive, then load small buckets into RAM and do a quicksort on those buckets, then just combine them when you are done iterating each bucket.
I have some reading to do now...
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|