CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 7 of 7

Thread: Comparison of multiple files

  1. #1
    Join Date
    Jul 2011
    Posts
    2

    Comparison of multiple files

    A text file that we used to get electronically we will now be getting as a printout (due to security changes, so I can't change this).

    I can scan the file in and use OCR (optical character recognition), but it's not perfect. I scanned the same page in several times on the same scanner and the OCR gives me slightly different results.

    The differences are fairly simply to humans - a "d" (D) on one version is a "cl" (C & L) on another. Or a space may be added (or skipped). This applies to lines, where one version might have an extra blank line where the other doesn't.

    My idea is to scan it 3 times and compare the files. If a line is the same on 2 of the 3, then it is declared good. If different on all 3, then human intervention is required.

    What if I expand this to scanning 4 times. Or 10 times??? I'm not sure if that will help or hurt...

    I've searched the web know that "diff3" is good for comparing 3 files, but usually one is the ancestor of the other 2. In this case. there is no "original" version, so that won't work too good. I couldn't find anything else about comparing multiple files.

    I'm trying to come up with a good algorithm for comparing 3 (or more) files. It should be a 1-line to 1-line (to 1-line) comparison, with an occasional blank line thrown in.

    Is there a good way to optimize the comparisons for each line, and/or the individual text within a line.

    (My department is using Perl, which is great at comparisons. I can even compare the lines with all white-space removed.)

  2. #2
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: Comparison of multiple files

    Well, solution one is a human one: complain to whoever is implementing the security change. Their paranoia is creating functionality problems. Maybe you can find a compromise that will get you the electronic version while satisfying their paranoia. Remember: the best solution is to eliminate the problem at the root!

    If that won't work, can this not be solved by having a human doing the scan/OCR check the output? It's been awhile, but I seem to recall OCR programs prompting me to look at regions it wasn't sure about. Worst case, can you run a spell checker?

    Your solution to scan it multiple times is fine for random errors, but I suspect it will not solve for systematic errors. (If the OCR program sees the same/similar bad data on a not-good-enough printout it will probably make similar mistakes). If you want to implement it, I suggest you run a pre-processing that strips out blank lines and leading/trailing whitespaces from lines and then does a character-by-character comparison of all three files ("two for loops, one looping over character positions, and interior one looping over the three files). Any disagreement should trigger human intervention.
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  3. #3
    Join Date
    Jul 2011
    Posts
    2

    Re: Comparison of multiple files

    Thanks for the reply. Unfortunately, as for getting to the "root" of the problem, this is military data that they are cracking down on, so we have no choice. "They" have moved the data to the secure area and "we" can't get to it electronically.

    Being Military data, it is FULL of TLA's (Three-Letter-Acroymns) and other abreviations, so a spell-checker won't work good, and it also makes it hard for a human to proof-read.

    Thus, I am looking at using redundency. Compare all 3 versions of the OCR output and if there is a difference in one, accept the 2 that match. If all 3 disagree, then it needs to be flagged for human checking.

    The problem is not in the quality of the printout. For my testing, I used a brand-new good-quality printout, on white paper. I scanned it in 3 times and got 3 slightly different results.

    The problem is that OCR software is not 100% reliable. For instance, it can easily misread an O (letter Oh) as a 0 (number Zero). It can confuse lowercase letter L's as number 1's. Even on a great printout, it has misread a lowercase "d" (D) as 2 lowercase letters "c" and "l" (CL - like "cl"). Those errors I can easily find with a program that compares line-by-line.

    It also adds and removes blanks within lines - between words and within "words". For example, it might change from/to (I have to use dashes to show blank characters since HTML compresses blanks together):
    ABC-DEFGH-IJKL-MNOPQ-RSTUVW-XYZ
    ABC--DEFGH-IJKLMNOPQ-RST-UVWXYZ
    My quick idea for that is to remove all white-space when I first compare the lines for a general compare.

    My biggest hurdle is where the OCR splits one line with several fields into several lines. I will need to use a "diff"-like comparison to find the lines that matchs and which lines do NOT match, and figure out how to combine it back.

    The more I think about it, the more I realize that my program will need to know the general format of the data file, so it can progressively search the data for the fields it is expecting next, using General Regular Expressions to isolate the data before it does the comparison. One OCR program I'm investigating uses "Business Rules" for that.

    Such is the fun. This is why I get paid... :-)

  4. #4
    Join Date
    May 2009
    Posts
    2,413

    Re: Comparison of multiple files

    Maybe the military could improve the print quality to help making OCR postprocessing more accurate. It could be as simple as using a better font on higher quality paper printed by a better printer. It would also help if all spaces were printed twice to make word separation better. Also an end-of-line sign could be printed followed by a CRC code.

    Maybe the military could accept comparing your OCR generated electronic copy with the original file in a computer at their location. They would then issue a paper printout stating the file identification followed by REJECTED or ACCEPTED.

  5. #5
    Join Date
    May 2009
    Posts
    2,413

    Re: Comparison of multiple files

    Quote Originally Posted by beschner View Post
    The more I think about it, the more I realize that my program will need to know the general format of the data file, so it can progressively search the data for the fields it is expecting next, using General Regular Expressions to isolate the data before it does the comparison. One OCR program I'm investigating uses "Business Rules" for that.
    In addition I think you should look at trainable OCR software. OCR becomes immensely more accurate when it's specialized to handle a certain text type.

  6. #6
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: Comparison of multiple files

    Quote Originally Posted by beschner View Post

    The more I think about it, the more I realize that my program will need to know the general format of the data file, so it can progressively search the data for the fields it is expecting next, using General Regular Expressions to isolate the data before it does the comparison.
    I can't help but quote Jamie Zawinski:

    Some people, when confronted with a problem, think I know,
    I'll use regular expressions. Now they have two problems.
    Seriously though; sounds like you have a good plan (quote notwithstanding, regexs are a good solution). Context-aware OCR will be a much more robust solution. The tradeoff, though, is rigidity: if the format changes, so too much your program.

    Sorry to hear you can't solve the root problem through clever application of a cluestick!
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  7. #7
    Join Date
    May 2011
    Posts
    22

    Re: Comparison of multiple files

    Hi,

    for the first problem, searching something like the diff of more than two files is very simple: just take the diff of file1 and file2, remove all changes take this file and do the same with the result file and file3, file4, ... filen.

    In the end you get a file in wich all changes between the given files are adds to the result file. So you get the "base" file for a n-way diff wich can be seen as a generalisation to the 3-way diff with one base file.

    I do not know, if there is a software out there, doing the job, because it was easier to implement the algorithm than to search for it :-)


    As a solution to the systematic errors (all OCRs give the same but wrong letter) you can try something, a former collegue once implemented: Do not give back only the text, but also the probability of the correctness of each letter and the possibillity of the next three or four letters orderd by likelihood for each letter position.


    GMarco

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Windows Mobile Development Center


Click Here to Expand Forum to Full Width




On-Demand Webinars (sponsored)