Is there an algorithm to compress based on repeating substrings?

**tridentblack** · April 24th, 2009, 07:46 PM

I'm looking for algorithm that will compress a given string by creating a lookup table of repeating substrings of the string, so that the size of lookup table together with the size of the encoded string is minimal. So for instance, the input string

ababcabcabc

could be compressed as

lookup table
s=ab
t=c
encoded string
sststst
12 total characters used

or alternatively,

lookup:
m=ab
j=abc
encoded:
mjjj
11 total characters used.

So the second example results in better compression, because it chooses better repeating substrings from the original string from the lookup table.

Anyway, assuming we are handed a way to get all the repeating substrings from a string, can any one think of a way to always get the smallest (lookup table+encoded string)? The answer may be simple or complex, but I wanted to throw it out to see if anybody knows it before I spend any more time trying to figure it out.

**SlickHawk** · April 25th, 2009, 02:03 PM

What you're looking for is a dictionary-based encoding scheme. The word "dictionary" alone should provide you ample avenues if you search, but "LZW" might be a good starting place.

Thread: Is there an algorithm to compress based on repeating substrings?

Thread Tools

Display

Is there an algorithm to compress based on repeating substrings?

Re: Is there an algorithm to compress based on repeating substrings?

Posting Permissions