I am working with very large files and I am storing a lot of data in a vector. I would like to make this process of copying the data as fast as possible. Since I'm using the STL I don't have to worry about programming the routines to be as a fast and efficient as possible. But I do have a choice:
1. Use vector_name.reserve(length) and then a series of push_backs to fill in the data
--or--
2. Use assign(array_begin_address,array_end_address)--without reserve, since assign would change the point anyway.
I am lucky enough to have been furnished with a very fast computer and have tried both ways and can't tell the difference--time-wise. However, the users of the my software might not be so lucky. If anyone could answer this I'd be very grateful.
-DM
TheCPUWizard
July 9th, 2008, 04:03 PM
Remember the first rule of performance...
Performance is ONLY a priority when you have measured that an implementation has a significant impact on meeting the documented performance requirements. In all other cases, readability and maintainability over-ride performance.
That being said, look at the logic of your code. If you are "logically" appending to the array, then push_back(...) is the clearest explaination of your intent. The addition of the reserve(...) keeps this intent visible, but provides the performance boost.
On the ofther hand, assign(....) indicates that you are putting someting into the vector at a specific place. A new reader of your code would have to examine the logic of your parameters to finaly realzie that this was appending to the end of the vector. Additionally a bug could easily cause you to perform some other action than actually appending (consider if the vector length changes somehow between the point where you calculate the offsets and where the assign takes place....
Looked at that way, the choice (at least to me( is clear!
Philip Nicoletti
July 9th, 2008, 07:51 PM
I prefer assign. If you have Scott Meyer's "Effective STL" see
item 5 : "Prefer range member functions to their single element
counterparts". The very first example is an "assign versus push_back".
On the ofther hand, assign(....) indicates that you are putting someting into the vector at a specific place.
Actually, assign replaces the contents of the container.
So it is the equivalent of a : clear() , reserve() , multiple push_backs().
You were probably think of insert(). (which Meyer's also prefers).
active2volcano
July 9th, 2008, 08:25 PM
What is the size of your "very large file"?Since this file is very large,why read the file to the memory?
Sccrman06
July 9th, 2008, 10:22 PM
First, thanks for your responses.
My large files are hundreds of MB's to GB's in size. I'm working with CAD and Discrete Data files and these vectors that I'm creating are a know size. The point of creating these vectors is to repair the data or CAD representation of these files by closing gaps and repairing overlaps.
The implementation of the array's is fixed. I am reading the information in from a file or from a user via an API. I can't touch the original arrays because the original integrity of the data cannot be compromised. Because I can't touch the original array's I'm putting the data in a data structure in a convenient wrapper, stl::vector, so that I can work with the data.
The question was posed to find the fastest and cheapest way to copy the data into a vector by using either assign or reserve+push_back. A pointer to a vector is a member of the class "Mesh" and these pointers are pointed to "newed" vectors during initialization of an instance of the Mesh class. As I understand it, the number of operations is the same since the data is copied in both cases and the number of assignments is the number of elements in the array long.
Assign destroys the data the old vector holds and allocates memory for the new array based on the pointers passed to "assign". Then it copies the data.
Reserve first allocates the array and the push_back copies the array contents into the vector. What I was wondering is if the overhead from so many push_back's out weigh a single call to assign. I know the same thing happens using both. What I was wondering was which one was faster?
active2volcano
July 9th, 2008, 11:46 PM
If all your data had been stored in the array, at the same time, the data is needn't nodified, the assignment is a better choice!
Else, You will choose the reserve.
laserlight
July 9th, 2008, 11:50 PM
If you have Scott Meyer's "Effective STL" see
item 5 : "Prefer range member functions to their single element
counterparts". The very first example is an "assign versus push_back".
I do not have a copy of that book with me now... so what is Meyers' reasoning considering that reserve() can be used?
active2volcano
July 9th, 2008, 11:55 PM
According to my experience,the two methods didn't have evident effect!
You need decompound the very large data till the data can push_back to the vector under your anticipant efficiency.
laserlight
July 10th, 2008, 12:12 AM
Meyers' may have more than just measurable efficiency in mind when he recommends assign over push_back.
active2volcano
July 10th, 2008, 01:57 AM
Using the assignment can be more effective than inserting the element one by one !But that depend on all your data must be stored in the sequence container , like the vector,list,array...
At the same time,while you insert the element ,you don't want to modify some element!
active2volcano
July 10th, 2008, 02:01 AM
This is the real meaning of Mayers<<Effective STL>> Item 5!
Zaccheus
July 10th, 2008, 04:28 AM
Sccrman06, if you can't load the data straight into the vector, I would use assign because that function does exactly what you are trying to do (make a vector which contains the given data) in one go.
TheCPUWizard
July 10th, 2008, 05:57 AM
Meyers' may have more than just measurable efficiency in mind when he recommends assign over push_back.
EXACTLY! :thumb: :wave:
The question of which is faster should be considered only after measurements are taken which determine if the time is meaningful.
Also finding the time in the general case is very very difficult. It may vary dpending on version/implementation of STL (different results for different compilers). It may also depend on locality of data (impacting L1/L2 caching, etc). It may also depend on the amunt of data.
Since all of these variables exist, one can not come up with a definitive answer (If anyone is willing to state that one is faster than the other for all possible conditions and put money on the table, let me know!) the decision should revet back to what is cleaner code.
Sccrman06
July 10th, 2008, 08:48 AM
Thanks for all of the advice guys.
I think I'll be going with assign since it makes for cleaner code. Assign requires one statement while reserve+push_back requires two+for-loop. I still would like to know which is faster in general. By that I mean which requires fewer operations or whatever it means to be "faster."
TheCPUWizard
July 10th, 2008, 08:52 AM
. I still would like to know which is faster in general. By that I mean which requires fewer operations or whatever it means to be "faster."
Re-read reply #13..... It is impossible to make a 100% accurate prediciton of which approach will be faster. I can generate conditions in a program which will make either one faster than the other for a VERY specific use case.
Going by "cleaner code" (which is what you are doing) is the only rational approach.
exterminator
July 10th, 2008, 09:41 AM
Do some benchmarking with higher precision timers. If possible use virtual machines to test in different OS-es/configurations. Also, depending upon your application, you can also try different memory allocators. Or even different data structures than a vector if memory consumption is pretty high. But all that depends on what your application is. Also, try out std::deque and if it has an overall impact on your application's performance.
But with vector, for such higher memory consumption, reserve must always be there: either directly or indirectly otherwise due to memory fragmentation, your application might be denied of the memory asked for and the frequent allocate, copy and destroy can become very very expensive.
Also, might be a good idea to work with the data in chunks rather than making a complete copy of it in a vector, if possible.
There can be a variety of ways to make things better, there's no end to it. But better in what ways is something that should be measured. Performance if is very critical then a little unmaintainable code does not harm, but you should try to document it well and keep it as simple as possible.
kempofighter
July 10th, 2008, 10:14 AM
Thanks for all of the advice guys.
I think I'll be going with assign since it makes for cleaner code. Assign requires one statement while reserve+push_back requires two+for-loop. I still would like to know which is faster in general. By that I mean which requires fewer operations or whatever it means to be "faster."
The push_back method seems rather clumsy in my opinion and it seems like common sense that there will be additional overhead for making the function call over and over again. I'm not certain of how the assign method will allocate the memory. According to cplusplus.com the complexity of the vector::assign operation is:
Linear on initial and final sizes (destruction, copy construction).
In the case of the iteration version, if the iterators are forward, bidirectional or random-access, the new capacity is determined before beginning, otherwise logarithmic complexity (object reallocations) must be added on top of that.
Perhaps someone who is more knowledgeable than I could help clarify that quote.
Either way, you need to use a profiling tool and try different approaches to see what is best or build in debug hooks that will calculate the time it takes to do the work in the various cases. It seems reasonable that you could still reserve memory upfront rather than rely on lazy evaluation. That way you don't have to worry about the efficiency of the assign operation because there will be enough memory prior to calling vector::assign. As far as push back vs. assign goes, I don't see how the hundreds or thousands of push_back calls could ever be faster than a single call to assign. The assign function was defined for a reason and your particular problem seems to be the reason that this function exists.
Lindley
July 10th, 2008, 10:25 AM
The push_back method seems rather clumsy in my opinion and it seems like common sense that there will be additional overhead for making the function call over and over again. I'm not certain of how the assign method will allocate the memory. According to cplusplus.com the complexity of the vector::assign operation is:
Linear on initial and final sizes (destruction, copy construction).
In the case of the iteration version, if the iterators are forward, bidirectional or random-access, the new capacity is determined before beginning, otherwise logarithmic complexity (object reallocations) must be added on top of that.
Perhaps someone who is more knowledgeable than I could help clarify that quote.
Fairly simple to explain. I'll lay it out, and also why push_back is perfectly reasonable in most circumstances, and especially if you've made an appropriate reserve() call first.
Linear on initial and final sizes (destruction, copy construction).
The assign() function must destroy all existing elements that it's replacing, and must copy all elements it's putting in. There's your linear terms.
In the case of the iteration version, if the iterators are forward, bidirectional or random-access, the new capacity is determined before beginning,
If you know how much space you're going to need before you start a copy, you can use reserve() beforehand. assign() does this internally (essentially), if it can. In that case, there is no additional complexity, because you've got the space and all you have to do is copy things onto it.
The same is true of push_back() in this case, incidentally. The memory is already there due to the reserve(), so the only additional overhead is the function call; it's just a copy operation beyond that, same as assign().
otherwise logarithmic complexity (object reallocations) must be added on top of that.
If you don't know how much space you'll need starting out, the usual strategy----and presumably the one used inside vectors, they'd be foolish to do anything else by default----is to make a guess, usually 1 or 2 elements to begin with. When you run out of space, you double the guess and copy all existing elements to the new memory. This may seem inefficient, but it's only logarithmic amortized time, since the reallocate/copy happens only when the size of the array doubles-----thus the more work is required to do it, the less often it happens.
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.