A text file that we used to get electronically we will now be getting as a printout (due to security changes, so I can't change this).

I can scan the file in and use OCR (optical character recognition), but it's not perfect. I scanned the same page in several times on the same scanner and the OCR gives me slightly different results.

The differences are fairly simply to humans - a "d" (D) on one version is a "cl" (C & L) on another. Or a space may be added (or skipped). This applies to lines, where one version might have an extra blank line where the other doesn't.

My idea is to scan it 3 times and compare the files. If a line is the same on 2 of the 3, then it is declared good. If different on all 3, then human intervention is required.

What if I expand this to scanning 4 times. Or 10 times??? I'm not sure if that will help or hurt...

I've searched the web know that "diff3" is good for comparing 3 files, but usually one is the ancestor of the other 2. In this case. there is no "original" version, so that won't work too good. I couldn't find anything else about comparing multiple files.

I'm trying to come up with a good algorithm for comparing 3 (or more) files. It should be a 1-line to 1-line (to 1-line) comparison, with an occasional blank line thrown in.

Is there a good way to optimize the comparisons for each line, and/or the individual text within a line.

(My department is using Perl, which is great at comparisons. I can even compare the lines with all white-space removed.)