Hello Experts,

I have a problem, maybe connected to the greatest common substring Problem.

There is a file of about 10-20 Mbytes. Inside this file, you can think of it as a textfile, there are few double parts of about 10-50 kbytes. It looks like
"sometextXXXanothertextXXXlasttextpart", where the XXX-s are identical strings.

How can I find as fast and as reliable as possible the double parts?

My first shot is dividing the whole file in single parts like lines or chapters, deriving hash-values und doing the modified gcs with the shorter file. But this is far away from reliable.

Any Ideas? Thank you!


GMarco