|
-
January 5th, 2012, 02:11 AM
#1
HashSet of strings - subsets
Hey all,
I've got a massive collection of stand-alone strings. I'm currently storing them in a hashset (maybe this is not the proper container for what I'm trying to do). I'm currently using the hashset for an efficient lookup to see if a particular string exists.
However, for optimization purposes, I need to test if a given string is the starting subset of any particular word in the hashset. (For example, if I feed it the word "ca" and my hashset contains the word "cat", I would want it to return true.)
If the string I feed it does NOT start any particular word in the dictionary, it should return false. (For example, if my hashset only contained one word, "cat", and I test the word "at", it should return false.)
Hopefully that makes sense. I've looked through the hashset documentation and wasn't sure if anything would fill this niche. If hashsets are not the proper container to use here, I'm definitely open to suggestions.
Thanks!
-
January 5th, 2012, 05:31 AM
#2
Re: HashSet of strings - subsets
The simplest way of implementing this would be to use a List<string> which is in sorted order and use a BinarySearch to figure out if it contains the item you require.
www.monotorrent.com For all your .NET bittorrent needs
NOTE: My code snippets are just snippets. They demonstrate an idea which can be adapted by you to solve your problem. They are not 100% complete and fully functional solutions equipped with error handling.
-
January 5th, 2012, 11:18 AM
#3
Re: HashSet of strings - subsets
Ah, excellent. I wasn't aware Lists contained a logarithmic lookup like that. Thanks!
-
January 6th, 2012, 01:32 AM
#4
Re: HashSet of strings - subsets
If you want to use a hashtable, use Dictionary<,>... see http://msdn.microsoft.com/en-us/library/xfhwa508.aspx
You could add all prefixes (so for cat add "c", "ca" and "cat") and then check for contains with yourDictionary.Contains("ca"); This would be O(1), but is space inefficient (it requires all words AND all prefixes).
You could use the list approach as Mutant Fruit suggested and get O(log n) where n is the number of strings.
I would suggest a tree data-structure where the children of any given node are the "next letter" in the word. Consider a tree contains the words { cat, can, cold, hat } then obtain:
Code:
(Root)
/ \
First letter c h
/ \ \
Second letter a o a
/ \ \ \
Third letter t n l t
\
Fourth letter d
Then to check to see if a prefix is present, descend from root. If all nodes exist, return true;
Example: Input "cat" and take the path: Root -> c -> a -> t; return true;
Example: Input "ca" and take the path: Root -> c -> a; return tur;e
Example: Input "hard"; and take the path Root -> h -> a. r does not exist! return false;
Example: Input "at"; take the path Root. a does not exist! return false;
This gives you a result n constant time (O(1)) in the number of words in your gigantic wordlist and a worst-case time complexity of O(k) where there are k letters in your query (which I think is as bad as hashtable... presumaby hashtable has dependence on the number of bits in the hashvalue).
Best Regards,
BioPhysEngr
http://blog.biophysengr.net
--
All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.
-
January 6th, 2012, 09:30 AM
#5
Re: HashSet of strings - subsets
another choice is sql server. store your strings in a table and use sql LIKE to search the matches.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|