-
April 24th, 2009, 07:46 PM
#1
Is there an algorithm to compress based on repeating substrings?
I'm looking for algorithm that will compress a given string by creating a lookup table of repeating substrings of the string, so that the size of lookup table together with the size of the encoded string is minimal. So for instance, the input string
ababcabcabc
could be compressed as
lookup table
s=ab
t=c
encoded string
sststst
12 total characters used
or alternatively,
lookup:
m=ab
j=abc
encoded:
mjjj
11 total characters used.
So the second example results in better compression, because it chooses better repeating substrings from the original string from the lookup table.
Anyway, assuming we are handed a way to get all the repeating substrings from a string, can any one think of a way to always get the smallest (lookup table+encoded string)? The answer may be simple or complex, but I wanted to throw it out to see if anybody knows it before I spend any more time trying to figure it out.
-
April 25th, 2009, 02:03 PM
#2
Re: Is there an algorithm to compress based on repeating substrings?
What you're looking for is a dictionary-based encoding scheme. The word "dictionary" alone should provide you ample avenues if you search, but "LZW" might be a good starting place.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|