HashSet of strings - subsets

**marshdabeachy** · January 5th, 2012, 02:11 AM

Hey all,

I've got a massive collection of stand-alone strings. I'm currently storing them in a hashset (maybe this is not the proper container for what I'm trying to do). I'm currently using the hashset for an efficient lookup to see if a particular string exists.

However, for optimization purposes, I need to test if a given string is the starting subset of any particular word in the hashset. (For example, if I feed it the word "ca" and my hashset contains the word "cat", I would want it to return true.)

If the string I feed it does NOT start any particular word in the dictionary, it should return false. (For example, if my hashset only contained one word, "cat", and I test the word "at", it should return false.)

Hopefully that makes sense. I've looked through the hashset documentation and wasn't sure if anything would fill this niche. If hashsets are not the proper container to use here, I'm definitely open to suggestions.

Thanks!

**Mutant_Fruit** · January 5th, 2012, 05:31 AM

The simplest way of implementing this would be to use a List<string> which is in sorted order and use a BinarySearch to figure out if it contains the item you require.

**marshdabeachy** · January 5th, 2012, 11:18 AM

Ah, excellent. I wasn't aware Lists contained a logarithmic lookup like that. Thanks!

**BioPhysEngr** · January 6th, 2012, 01:32 AM

If you want to use a hashtable, use Dictionary<,>... see http://msdn.microsoft.com/en-us/library/xfhwa508.aspx

You could add all prefixes (so for cat add "c", "ca" and "cat") and then check for contains with yourDictionary.Contains("ca"); This would be O(1), but is space inefficient (it requires all words AND all prefixes).

You could use the list approach as Mutant Fruit suggested and get O(log n) where n is the number of strings.

I would suggest a tree data-structure where the children of any given node are the "next letter" in the word. Consider a tree contains the words { cat, can, cold, hat } then obtain:

Code:

                       (Root)
                       /    \
First letter          c      h
                     / \       \
Second letter       a   o       a
                   / \   \       \
Third letter      t   n   l       t
                           \
Fourth letter               d

Then to check to see if a prefix is present, descend from root. If all nodes exist, return true;

Example: Input "cat" and take the path: Root -> c -> a -> t; return true;

Example: Input "ca" and take the path: Root -> c -> a; return tur;e

Example: Input "hard"; and take the path Root -> h -> a. r does not exist! return false;

Example: Input "at"; take the path Root. a does not exist! return false;

This gives you a result n constant time (O(1)) in the number of words in your gigantic wordlist and a worst-case time complexity of O(k) where there are k letters in your query (which I think is as bad as hashtable... presumaby hashtable has dependence on the number of bits in the hashvalue).

**Silent Sojourner** · January 6th, 2012, 09:30 AM

another choice is sql server. store your strings in a table and use sql LIKE to search the matches.

Thread: HashSet of strings - subsets

Thread Tools

Display

HashSet of strings - subsets

Re: HashSet of strings - subsets

Re: HashSet of strings - subsets

Re: HashSet of strings - subsets

Re: HashSet of strings - subsets

Posting Permissions