Re: Need advice - what to use CStringArray or CStringList
For example, if you just timed this search:
Code:
MapCStringToCString::iterator itFound = myData.find( strToSearch );
if ( itFound != myMap.end() )
{
// we've found the name.
itFound->first // is the lower case version of the found name
itFound->second // this is the original case version of the name
}
Compared with your loop, you will see that this method runs rings around using arrays and looping methods. The only way the map wouldn't be faster is if your item in the array is within the first 15 or so elements in the array, and even then, the array method could be slower.
If the items your finding are in the middle of your array, you do the math. You're calling CompareNoCase() on average 50,000 times to find one item. With the map, you're calling the straight compare (actutally the CString's operator ==) a maximum of 16 times. There really isn't any doubt which would be faster.
Also for Visual C++, you need to set the preprocessor flag to enable full optimizations for STL components. That preprocessor flag is _SECURE_SCL=0 (add this to your preprocessor settings and rebuild your release version of the app). You will probably see a dramatic speed increase in the map method.
Regards,
Paul McKenzie
Re: Need advice - what to use CStringArray or CStringList
It is not clear what you are timing ... adding 128,079 elements to
any of the containers mentioned should be very fast ... not in the
40 second range. Are you sure that the container part of your
algorithm is problem ?
Also, using a binary search or a map would only help on the second
search pass of your algorithm ... and only help the search part ... it
you do other processing in the second pass you need to time that
separately.
Re: Need advice - what to use CStringArray or CStringList
Quote:
Originally Posted by
ahmd
One more interesting result, so I don't really know what to think.
128,079 files, ran only the first pass.
In case of CStringArray ---- 00:40 sec
In case of std::map<CString, CString> ---- 00:45 sec
If I understand what you have posted so far ... the first pass searches
thru your folders, and adds selected files to a container. If this is
the case, then, as I mentioned in my previous post, adding to the
container is probably not slowing you down on the first pass.
Some example timings:
Code:
A) Adding 131,256 strings of size 200 to
* a vector : 0.156 seconds
* a CStringArray : 0.172 seconds
* a map : 0.985 seconds
* a deque : 0.061 seconds
B) Adding 965,503 strings of size 200 to
* a vector : 0.703 seconds
* a CStringArray : 5.547 seconds
* a map : 9.594 seconds
* a deque : 0.25 seconds
This is about what I would expect. Adding to a vetcor/Cstring Array
is O(n) (amortized). Adding to a map is O(n * LOG(n) )
A deque is also O(n) to add ... but does not do rellocates/re-copies.
Also, as the number of elements gets large, CStringArray does a lot
more reallocations/copies than vector. (I don't know if MS has changed
the re-allocate algorithm in the its later versions of the compiler).
A map is going to be slower at adding ... its use is for fast searching ...
O(LOG(n)) compared to a linear search : O(n)
Re: Need advice - what to use CStringArray or CStringList
Guys, you misread my results. You need to look at my first post:
http://www.codeguru.com/forum/showpo...5&postcount=27
that gave the timing for the "real life" (actual) situation. As you can see I have a gain of 1 second with Paul's method.
Since that number of runs (862 files, 646 searches) was way lower than I announced, I decided to try the first pass only for a greater number of files with Paul's method. And that is the results from the second thread:
http://www.codeguru.com/forum/showpo...5&postcount=29
Again, in this case all that was done was a recursive search for all the files in a folder (including subfolders), and the second pass was not performed (due to the reasons of time it would take and due to the fact that I could not perform an algorithm of the second pass on every type of file included in the first pass.)
Re: Need advice - what to use CStringArray or CStringList
Quote:
Originally Posted by
ahmd
Guys, you misread my results. You need to look at my first post:
http://www.codeguru.com/forum/showpo...5&postcount=27
that gave the timing for the "real life" (actual) situation. As you can see I have a gain of 1 second with Paul's method.
Since that number of runs (862 files, 646 searches) was way lower than I announced, I decided to try
the first pass only for a greater number of files with Paul's method. And that is the results from the second thread:
http://www.codeguru.com/forum/showpo...5&postcount=29
Again, in this case all that was done was a recursive search for all the files in a folder (including subfolders), and the second pass was not performed (due to the reasons of time it would take and due to the fact that I could not perform an algorithm of the second pass on every type of file included in the first pass.)
I'm not sure what you mean by mis-read your results. Your description above
is exactly what I thought your results represented.
Obviously, with 862 files and 646 searches, the "container-portion" of your
algorithm will takle less than a second (not matter what type of container
you use).
It is only for larger number of elements and searches that the container to use
and the search method will be significant.
Even about 200,000 elements will not take long to add ... but could take a long time
if you do 200,000 linear searches on them.
Re: Need advice - what to use CStringArray or CStringList
An example of doing a linear search versus a binary search on sorted data:
20,000 strings in the container ...
20,000 searches (searched for each element in the container):
a) linear search : 45 seconds
b) binary search (0.2 seconds ... including the time to do the sort).
It would get even more dramatic as the number of elements/searches increases.
Re: Need advice - what to use CStringArray or CStringList
Quote:
Originally Posted by
ahmd
It seems for Visual C++, the array search is very fast, and will only be noticeable if the number of elements are very large.
However regardless of that, the fact still remains that you will be calling a very fast comparison routine (either CompareNoCase or CString::operator ==) a maximum of 100,000 times using an array of 100,000 names, as opposed to calling CString::operator == (the map::find() calls it) a maximum of 16 times if using a map.
I don't know about you, but I would take 16 times max over 100,000 times max any day.
Regards,
Paul McKenzie
Re: Need advice - what to use CStringArray or CStringList
Yes, Paul, I'll definitely use the map::find() search method. One question though, I observed during my tests that when I used maps vs. CStringArray the memory usage spiked at almost 10MB more. Does map array use more memory then a linear CStringArray and should that be my concern for a big number of files on a system with less RAM?
Also, Philip, I agree with you that the operations with the memory container itself won't take up all the time that I reported. Most of that time is used by the file I/O operations, but in this case it doesn't matter since I ran all three tests (with your methods) with the same set of files.
Again, my original issue was that CStringArray was getting progressively slower when I was adding a big number of files into it (on par of hunderds of thousands), but as my last test showed that std::map and Paul's method (excluding the search part) is actually a bit slower than CStringArray when adding new elements. (My only doubt now after Paul's previous post is whether I ran his method in a debug build by accident and that is why it showed such a stark contrast in time.)
Re: Need advice - what to use CStringArray or CStringList
Quote:
Originally Posted by
ahmd
Yes, Paul, I'll definitely use the map::find() search method. One question though, I observed during my tests that when I used maps vs. CStringArray the memory usage spiked at almost 10MB more. Does map array use more memory then a linear CStringArray and should that be my concern for a big number of files on a system with less RAM?
Also, Philip, I agree with you that the operations with the memory container itself won't take up all the time that I reported. Most of that time is used by the file I/O operations, but in this case it doesn't matter since I ran all three tests (with your methods) with the same set of files.
Again, my original issue was that CStringArray was getting progressively slower when I was adding a big number of files into it (on par of hunderds of thousands), but as my last test showed that std::map and Paul's method (excluding the search part) is actually a bit slower than CStringArray when adding new elements. (My only doubt now after Paul's previous post is whether I ran his method in a debug build by accident and that is why it showed such a stark contrast in time.)
maybe I am getting a little confused on the tests that you ran.
Correct me if I am wrong:
Test # 1) 862 files ... 646 searches ... ran both passes
All methods were about the same (as expected). The container
operations (adding 862 elements and doing 646 searches on 862
elemnets will take less than a second, even on a slow computer).
Test # 2) Adding 128,079 files to the container ... no searches.
The CStringArray was faster than the map (again as expected) ...
As shown by both your results and the benchmarks that I
provided also).
Adding to a CStringArray is O(n) amoritized. Adding to a map
is O(n*Log(n)), since it sorts the the elements when it adds.
As the number of elements increases, a map might become
faster than a CStringArray at adding elements, because
of CStringArrays non-agressive reallocation algorithm
(at least, up to VC2003)
The reason for using a map or sorted container is to speed up
the searches, not in the insertions.
And yes, CStringArray will get slower when inserting elements
(because of rellocations and recopies). vector will be faster
since it has a more aggressive reallocation scheme. deque
will be faster still as it does not do any reallocations.
(see my previous benchmarks on this).
Re: Need advice - what to use CStringArray or CStringList
Quote:
Originally Posted by
ahmd
Yes, Paul, I'll definitely use the map::find() search method. One question though, I observed during my tests that when I used maps vs. CStringArray the memory usage spiked at almost 10MB more.
A map is usually implemented as a tree, so to maintain the tree struture requires more memory for each item added to the tree. The extra elements are the branches of the tree, so that searching and traversing the tree is done in minimal time.
An array is probably the data structure that requires the least memory for storing string data. Any other container, be it a linked list, deque, map, etc. requires more memory. But for that saving in memory, you sacrifice fast(er) lookups, and you get slower insertions/deletions.
Regards,
Paul McKenzie
Re: Need advice - what to use CStringArray or CStringList
a map is usually used for "key/value" lookup. You do not have a
value,so a map might not make the most sense (a std::set is
possible).
But even using a std::set, memory will spike. I recommend using
one of the dynamic array containers(CStringArray, vector, deque).
AFTER adding ALL the elements, sort them. Then you can use
a binary search to get fast lookups without using the extra
memory.
Quick test: 200,000 vector , sorted, looking up all 200,000
elements using a binary search : 2.2 seconds (including the
one time sort).
Re: Need advice - what to use CStringArray or CStringList
Quote:
Originally Posted by
Philip Nicoletti
maybe I am getting a little confused on the tests that you ran.
Correct me if I am wrong:
Philip, come on, I said this several times already. You're right in your assumptions, except that you don't see that I'm constructing this array of file paths after finding them out with FindFirstFile/FindNextFile API's. Then during the second pass I open and read each file to get to its contents. That is what is most of that time is used up for when the array is small.
So, guys, what is the bottom line here? That the fastest way would be to use CStringArray, comprised of low-case paths, sorted alphabetically, and then use binary search for a look-up. Is that it?
If so, one question. What is more effecient to add elements according to the sort order, or add them as they are and sort the whole array when we're done?
Re: Need advice - what to use CStringArray or CStringList
1) if possible, always add all elements before sorting ... O(nLogN)
versus O(n*n). In your case, you can add all elements first.
2) For small number of elements, adding to a CStringArray is fine. As
the number of elements gets larger, adding to the CStringArray gets
more and more inefficient. I suggest running a benchmarks with
the typical values that you expect. I would use a vector or deque
instead. Again, run a few benchmarks to see the difference in
performance.
3) For large number of files with a large number of searches, you
would have to go with a conatiner that can do fast searches.
Have you tried doing 100,000 searches on 100,000 elements using
a linear search versus doing a binary search ?
Quote:
Philip, come on, I said this several times already. You're right in your assumptions, except that you don't see that I'm constructing this array of file paths after finding them out with FindFirstFile/FindNextFile API's. Then during the second pass I open and read each file to get to its contents. That is what is most of that time is used up for when the array is small.
I am not sure why you think I am not seeing that. Every benchmark
timing that I have provided assumed that.
My conclusiion : For a small number of files, the container
used and the serach method will not take much time. It is
only for a large number of files and searches that the container
and search method matters.
Re: [RESOLVED] Need advice - what to use CStringArray or CStringList
OK. Last post. Thank you everyone, who contributed, especially Paul and Philip, I appreciate your insight.