Fast keyword search through CSortedArray??? Anyone know how?

Printable View

Show 50 post(s) from this thread on one page

September 27th, 2006, 03:38 PM
Vanaj

Re: Fast keyword search through CSortedArray??? Anyone know how?

Quote:

Originally Posted by revg75

Ya my search requirements are a bit weird, but not that weird.

If an item in the list was

iamadonkey

The user could type 'don', or 'donk', or 'key' and it would return a match. However if the user typed 'yeknod' it would NOT return a match.

That's an easy strcmp() or cstring.Find(), but you do have to look at all the words to compare...
September 27th, 2006, 03:39 PM
GCDEF

Re: Fast keyword search through CSortedArray??? Anyone know how?

I don't see a sorted array helping then. I don't know an elegent solution that doesn't involve brute force or an huge indexing structure. As I said, the best think I can think of is don't start the search with every keystroke. Just wait till the user's given enough information to narrow the first set of results considerably.
September 27th, 2006, 03:44 PM
Vanaj

Re: Fast keyword search through CSortedArray??? Anyone know how?

I think GCDEF is correct...you would have to index on each word and each word starting with next letter in that word to get a good index. OR just do the search when the word is completed by the user...
September 27th, 2006, 03:53 PM
revg75

Re: Fast keyword search through CSortedArray??? Anyone know how?

Alrighty! Perfect thanks alot guys.

I will just have them click a search button when they are done.

Just incase you are interested I was trying to mimic the Winamp jump search box. If you start winamp and press the J key a little dialog comes up where you can type, and as you type the media library narrows the search.

I haven't tried winamps search with 10's of thousands of records, but I am assuming that their will handle that.

Wonder how they did it??

Cheers,
Greg
September 27th, 2006, 04:43 PM
Philip Nicoletti

Re: Fast keyword search through CSortedArray??? Anyone know how?

I don't know the particulars of CSortedArray, but in general this
is all you have to do (you could do this if using a normal CArray
that is sorted).

1) Call std::lower_bound with the current "search" string. It will
return the first position that starts with that string. (O(logN))

2) starting at that position, do a linear loop until, adding to your
ist until:

a) you reach the end of the CSortedArray
b) the start of the current position in the CSortedArray does
not equal the search string.
September 27th, 2006, 05:16 PM
Vanaj

Re: Fast keyword search through CSortedArray??? Anyone know how?

Quote:

Originally Posted by Philip Nicoletti

I don't know the particulars of CSortedArray, but in general this
is all you have to do (you could do this if using a normal CArray
that is sorted).

1) Call std::lower_bound with the current "search" string. It will
return the first position that starts with that string. (O(logN))

2) starting at that position, do a linear loop until, adding to your
ist until:

a) you reach the end of the CSortedArray
b) the start of the current position in the CSortedArray does
not equal the search string.

This would work but the OP was to search for the "string" anywhere in the array..NOT just starting with the search string.

if the search was for "er" he wants the search to return the following

biker
error
ferrot

and so on....it is more of a cstring.Find() than a cstring.Compare() search.
September 27th, 2006, 06:58 PM
Philip Nicoletti

Re: Fast keyword search through CSortedArray??? Anyone know how?

Sorry, you are right ... I did not read all of the posts.
September 27th, 2006, 07:00 PM
Vanaj

Re: Fast keyword search through CSortedArray??? Anyone know how?

Quote:

Originally Posted by Philip Nicoletti

Sorry, you are right ... I did not read all of the posts.

Not a problem..I thought the same thing at first..then he turned on the light....
September 27th, 2006, 07:05 PM
wildfrog

Re: Fast keyword search through CSortedArray??? Anyone know how?

Quote:

Originally Posted by revg75

Incidentally, would putting the search in a seperate thread help with the search freezing up the rest of my app while typing?

What I would do is to put it in another thread, but delay the actual search for a couple of milliseconds (lets say 300ms). Then, if the user types in anything more before the search has started then delay the search some more. Then, only when the user stops typing (and expects the list to be updated... he can probably wait 300ms) then the search thread starts filtering the list.

If possible maybe you can stop an ongoing search if the user begins typing again. This way you don't trigger away 5 unneccessary seaches when the user types 'monkey' in 0.4 seconds.

The keyboard should stop hogging aswell.

- petter
September 28th, 2006, 12:37 PM
Yves M

Re: Fast keyword search through CSortedArray??? Anyone know how?

I've tried using an index with an inverted file (i.e. indicies pointing to (string, letter position) pairs), but I only get about a 10 fold performance increase over plain search. It's still optimisable though, but I don't know how much it will bring. And it comes at the cost of multiplying storage by 5.
September 28th, 2006, 01:44 PM
Vanaj

Re: Fast keyword search through CSortedArray??? Anyone know how?

Quote:

Originally Posted by Yves M

I've tried using an index with an inverted file (i.e. indicies pointing to (string, letter position) pairs), but I only get about a 10 fold performance increase over plain search. It's still optimisable though, but I don't know how much it will bring. And it comes at the cost of multiplying storage by 5.

Good info...not too many people would do something like this to help...nice work Yves..
September 28th, 2006, 03:06 PM
Yves M

Re: Fast keyword search through CSortedArray??? Anyone know how?

By the way, here is the code so far. There are a couple of things that could be optimized (mainly the set intersection algorithm), so if you feel like trying to pursue this route further...

Code:

struct inverted_file { typedef std::vector<std::vector<std::pair<size_t, unsigned char> > > index_type; index_type index; size_t char_to_index(char c) { if (c <= 'Z') { return c - 'A'; } else { return c - 'a' + 26; } } void construct_index(const std::vector<std::string> &data); void construct_subset(const std::string &s, const std::vector<std::string> &data, std::vector<std::string> &output); }; void inverted_file::construct_index(const std::vector<std::string> &data) { // The size of the index is the number of different letters that may appear const size_t index_size = 26 * 2; index.resize(index_size); index_type::iterator it; // Reserve in each component vector in the index roughly total/index_size elements for (it = index.begin(); it != index.end(); ++it) { it->reserve(data.size() / index_size); } // Now start constructing the index size_t dataoffset = 0; for (std::vector<std::string>::const_iterator datait = data.begin(); datait != data.end(); ++datait, ++dataoffset) { unsigned char i = 0; // For each character, save the position of the string it appears in and its position inside the string for (i = 0; i < datait->length(); ++i) { index[char_to_index((*datait)[i])].push_back(std::make_pair(dataoffset, i)); } } } void inverted_file::construct_subset(const std::string &s, const std::vector<std::string> &data, std::vector<std::string> &output) { // Allocate the current offset inside the index std::vector<std::vector<std::pair<size_t, unsigned char> >::const_iterator> offset, end; for (std::string::const_iterator strit = s.begin(); strit != s.end(); ++strit) { offset.push_back(index[char_to_index(*strit)].begin()); end.push_back(index[char_to_index(*strit)].end()); } // Prepare the output buffer output.clear(); // this is essentially a set intersection algorithm bool bend = false; size_t n = s.length(), max_string_index = 0; unsigned char max_char_index = 0; while (offset[0] != end[0]) { // advance the first offset to the maximum found before while (offset[0] != end[0] && (offset[0]->first < max_string_index || (offset[0]->first == max_string_index && offset[0]->second < max_char_index))) { ++offset[0]; } // Now look for a sequence where the strings are equal and string positions are consecutive size_t i = 1, current_string; unsigned char current_pos; current_string = offset[0]->first; current_pos = offset[0]->second; while ((i < n) && (offset[i] != end[i]) && (offset[i]->first < current_string || ((offset[i]->first == current_string) && (offset[i]->second <= (current_pos + i))))) { if ((offset[i]->first == current_string) && (offset[i]->second == (current_pos + i))) { ++i; } else { ++offset[i]; } } if (i == n) { // We found a match output.push_back(data[offset[0]->first]); // Use this code for multiple copies of the string if the substring is found multiple times // ++offset[0]; // Use this code for a single copy while (offset[0] != end[0] && offset[0]->first == current_string) { ++offset[0]; } } else if (offset[i] == end[i]) { // We have reached the end of one of the character lists offset[0] = end[0]; } else { max_string_index = offset[i]->first; max_char_index = offset[i]->second - i; } } }

Show 50 post(s) from this thread on one page