Curve Comparison Algorithm?

**wdolson** · November 9th, 2011, 05:46 AM

I'm making updates and maintaining some very old code. The original program suite was written for Windows 1.0 and except for a port to Win 32 and what I've been doing the last year, there have been few updates. This system does many things but one area where it is weak is in a curve comparison function.

Basically it compares two XY curves using a tolerance and spits out a pass/fail. The user can enter the tolerance, or they can use an automatic tolerance. The automatic tolerance has problems, and the comparison can spit out false passes when the curve are obviously very different (by visual inspection) and sometimes does a false fail on curves that visually look the same on the graph.

Eventually I'm going to be able to do a major rewrite of the whole system, but before I can start on that, we need to get to a stable spot with what they are selling now.

I need a better algorithm than we're using now. A search of the net turned up a lot of comparison logic for doing sorts, and some stuff on comparing two points, but I need to be able to test two curves against one another. The data presented to the comparison is basically two arrays of floats with X and Y data. The two arrays can be differing size as the user can change the number of data points in a test.

This system is collecting data via I/O. The way it's done now you always have a Control sample that is known working and that is compared against a unit under test.

In the future, my customer would like to be able to test many samples and do statistical analysis on the data then come up with a statistical Control sample. In this scenario, you'd have say 100 units, you know most of them are good, but some may be counterfeit. You don't know if a specific unit is real or counterfeit though. By running a statistical analysis on all the data, we can hopefully develop a middle of the bell curve Control sample that future units will be compared against.

That's coming sometime next year. Right now the scenario is the simpler one with just two curves. One is the Control from a known good unit. The other curve is comparing against it. If both curves always had the same number of data points and the same start and end points, that would be very simple. What confounds me in making an algorithm is that curves that would still pass could start at different points, and have a different number of data points.

For example, one curve could start at -10 and go to +10 with 200 samples. The other curve could go from -7 to +7 with 100 samples. The comparison is essentially trying to do what the eyeball can do intuitively looking at the curves on the screen. If they cover each other close enough for the tolerance, it passes. If they don't have the same shape (within tolerance), it fails.

Are there any curve comparison algorithms out there? Can anyone recommend any books, articles, websites, or even sample code?

This is where my EE background instead of CompSci background is failing me. I'm an embedded programmer who has moved into a mix of embedded and the Windows app world. The math I took in school (20+ years ago) was more practically focused than the more theoretical stuff the CompSci and Math majors got (at my school back then the CompSci department was under the same umbrella as the Math department).

How it currently works - In case anybody cares…

The auto tolerance is set by getting the sample size (range/numSamples) and multiplying by a factor. In some cases, this results in a tolerance wider than the entire range.

Then it narrows the compare by looking at the endpoint X values (Xmin and Xmax). It takes the lesser of the two. For the above example with a range of -10 to +10 and -7 to +7, the range for the compare would be -7 to +7.

Then it takes the starting point in one curve and compares the X and Y to every single point in the other curve. If any point in the other curve falls within the box created by the tolerance, that point passes and it goes on to the next point.

The point by point comparison method tends to work OK when the tolerance is something reasonable, but the way auto tolerance is done, that can allow false passes through. I haven't figured out why we get false fails though. My customer is trying to get some real world data from one of his customers in which false fails are happening.

Intuitively, I don't like the compare one point to everything method. It just doesn't seem right, though it might be among the better choices if the tolerance could be set more intelligently.

Thanks,

Bill

**superbonzo** · November 9th, 2011, 06:18 AM

first of all, this should have been posted in the algorithm forum, being not related to c++ programming ...

anyway, do you want a statistical measure of the similarity of two shapes or just an heuristic psychophysically based estimate of their similarity ?
in the former case, you should define a statistical model of the data ( coming both from the tested and the reference apparatus ) and apply some goodness of fit estimate ( google "goodness of fit" for many,many possible examples ... )

and, yes, your current algorithm it's definitely wrong ...

**nuzzle** · November 9th, 2011, 01:25 PM

Originally Posted by wdolson

If both curves always had the same number of data points and the same start and end points, that would be very simple. What confounds me in making an algorithm is that curves that would still pass could start at different points, and have a different number of data points.

You can fix this by using linear interpolation between data points. Then, in principle, curves are defined "everywhere" and not just at discrete points. This allows you to compare an arbitrary number of points within an interval two curves share.

One measure of similarity is what you do now, namely to reqiure that the distance between the curves are smaller than a certain tolerance at all points. But you can relax this by allowing a certain amount of outliers. This may reduce the problems with false fails caused by a small number of "freak" data points.

**wdolson** · November 10th, 2011, 04:29 AM

Originally Posted by superbonzo

first of all, this should have been posted in the algorithm forum, being not related to c++ programming ...

anyway, do you want a statistical measure of the similarity of two shapes or just an heuristic psychophysically based estimate of their similarity ?
in the former case, you should define a statistical model of the data ( coming both from the tested and the reference apparatus ) and apply some goodness of fit estimate ( google "goodness of fit" for many,many possible examples ... )

and, yes, your current algorithm it's definitely wrong ...

Sorry about the wrong forum, I didn't scroll down far enough when looking. I see the algorithm forum down lower.

I think what we're looking for is the latter type of comparison. Basically, do these two curves lie on top of one another (within the bounds of tolerance)?

Thanks for the sanity check. This algorithm has been driving me crazy for a couple of months. I thought I had it working well enough, but then a customer ran into problems this week. I think I've been able to determine that the customer was using tolerances that were way too wide, which accounted for about 80% of their problem, but the remaining 20% is our problem.

My customer claims that this worked on the old system before it was updated to Win32. I've looked at the old Win 16 code and it worked the same way as it does now, so I don't know how it ever worked right.

Thanks for the terms to google for. I had some concept what I was looking for, but didn't have the terminology to do a good search.

Bill

**superbonzo** · November 10th, 2011, 06:39 AM

Originally Posted by wdolson

I think what we're looking for is the latter type of comparison. Basically, do these two curves lie on top of one another (within the bounds of tolerance)?

I see, well, I've never read anything specific on the subject ( that is, on the vision process of matching superimposed 2d shapes ) so I cannot tell for sure ... anyway, thinking about it, I see two ways people behaves when asked on "whether two shapes match" ( say, drawn on two stacked tracing papers ):

1) one moves the two sheets to get a best overall matching of the shapes, estimating differences in terms of "hole" areas, parallelism or other visual features; the eyeballs move erratically from point to point of the examination plane, collecting informations on the overall geometry. This seems a very complex task to simulate being influenced by so many factors ( just consider phenomena like the appereance of invisible edges, or the influence with other cognitive processes, like recognizing a typographical character or symbol ... ).

2) after a preparatory overall matching, one chooses a starting point of both shapes and moves the two sheets to match them; then, the resulting rails-like path is followed estimating the "distance" between them; this is similar to your current approach as the eyeballs move along the rails;
so, you should model the way the eyeballs continously pick pairs of points and evaluate the distance between them, taking into account any local geometrical feature of the paths, including any eventual noisiness. Again, this process may be influenced by contextual properties of the experiment, like instructing/inducing the subject on doing something or not;

Anyway, assuming that the shapes coordinates data don't need any preliminary trasformation ( translation, rotation, scaling, etc ... ), a trivial approach would be:

1) the initial value of the points pair (p10,p20) is the pair of starting points of the two curves, whose distance must be within tolerance.

2) given a pair (p1i,p2j), if |p1i+1 - p1i| >= | p2j+1 - p2j | find J >= j such as | p1i+1 - p2J | is a local minimum, let's call it M; if M is above the tolerance stop, otherwise set (p1i+1,p2J) as the new pair; if |p1i+1 - p1i| < | p2j+1 - p2j | do the same as above, with p1 and p2 swapped. Repeat till the end.

in order to estimate an automatic tolerance, one could perform the algorithm above dropping the tolerance comparisons and caching the local minima M's.
then you could study the distribution of such minima to get a picture of the "pathness" of the curves and their deviations; for example, two smooth paths differing by gaussian noise of amplitude small with respect to the mean curvature of the paths would give a distribution of the squares of M's resembling the chi-square distribution, whereas unrelated paths would produce distribution with much longer tails ...

Originally Posted by wdolson

My customer claims that this worked on the old system before it was updated to Win32. I've looked at the old Win 16 code and it worked the same way as it does now, so I don't know how it ever worked right.

are you sure it's not just a bug in the numerical logic, related to integer data type sizes ? are the results reasonably the same ?

**JohnW@Wessex** · November 10th, 2011, 08:17 AM

If you're interested in the similarity of shape then you could try using correlation, or its close cousin, covariance.

**superbonzo** · November 10th, 2011, 08:38 AM

Originally Posted by JohnW@Wessex

If you're interested in the similarity of shape then you could try using correlation, or its close cousin, covariance.

depending on what you compute the correlation of, this would not necesseraly give a result comparable to a "visual" match; of course, such calculations would probably appear in some of form if the OP were interested in a statistical comparison of shapes, but then a statistical model of the data would be needed in order to obtain meaningful results anyway ...

**JohnW@Wessex** · November 10th, 2011, 12:16 PM

1) one moves the two sheets to get a best overall matching of the shapes,

Actually, come to think of it, that reminds me a bit of 'cross correlation'. You get a maxima where the curves align.
You could then apply a different algorithm to check the delta between samples on the curves, maybe after normalising the scale?

**wdolson** · November 10th, 2011, 09:18 PM

Originally Posted by superbonzo

Anyway, assuming that the shapes coordinates data don't need any preliminary trasformation ( translation, rotation, scaling, etc ... ), a trivial approach would be:

1) the initial value of the points pair (p10,p20) is the pair of starting points of the two curves, whose distance must be within tolerance.

2) given a pair (p1i,p2j), if |p1i+1 - p1i| >= | p2j+1 - p2j | find J >= j such as | p1i+1 - p2J | is a local minimum, let's call it M; if M is above the tolerance stop, otherwise set (p1i+1,p2J) as the new pair; if |p1i+1 - p1i| < | p2j+1 - p2j | do the same as above, with p1 and p2 swapped. Repeat till the end.

in order to estimate an automatic tolerance, one could perform the algorithm above dropping the tolerance comparisons and caching the local minima M's.
then you could study the distribution of such minima to get a picture of the "pathness" of the curves and their deviations; for example, two smooth paths differing by gaussian noise of amplitude small with respect to the mean curvature of the paths would give a distribution of the squares of M's resembling the chi-square distribution, whereas unrelated paths would produce distribution with much longer tails ...

The problem is a bit more difficult because the beginning and end point of the two curves may not be the same X, though this may be OK. Additionally one curve can have more points than the other.

are you sure it's not just a bug in the numerical logic, related to integer data type sizes ? are the results reasonably the same ?

All the data points are floats, which I believe are the same size in Win 16 and 32, though there might be some rounding differences between the two systems.

Bill

**nuzzle** · November 10th, 2011, 11:17 PM

Originally Posted by wdolson

The problem is a bit more difficult because the beginning and end point of the two curves may not be the same X, though this may be OK. Additionally one curve can have more points than the other.

See my previous reply for how to handle this.

The most likely cause of false fails is also mentioned there.

**superbonzo** · November 11th, 2011, 03:23 AM

Originally Posted by wdolson

The problem is a bit more difficult because the beginning and end point of the two curves may not be the same X

yes, I forgot that part, but it's just a matter of ignoring the first and last tolerance comparison to accomodate for that.

Originally Posted by wdolson

Additionally one curve can have more points than the other

this is already taken into account by the algorithm ( unless I'm missing something, of course

)

Originally Posted by wdolson

All the data points are floats, which I believe are the same size in Win 16 and 32, though there might be some rounding differences between the two systems.

you're right, but there could be intermidiate integer calculations you may want to check anyway ... ( code like 1.f*(2*3) ... )

Originally Posted by nuzzle

See my previous reply for how to handle this.

interpolation would be of little help here; as far as non matching start/end points is concerned it would not necesserely produce acceptable points ( the higher the order of interpolation the more sensible extrapolated points will be to data noise, giving even more false negatives ); regarding interpolated data points in the common range of definition, it may be of little use ( for example, linear interpolation would not sensibly affect distance minima calculations ) or introduce possibly unwanted effects ( for example, in case of noisy data a smoothing pass would be required ).

Originally Posted by nuzzle

The most likely cause of false fails is also mentioned there.

do you mean an outlier cutoff ? if you don't preemptively estimate the expected number of outliers such a threshold would reduce the number of false positives simply at the cost of growing the false negatives, in general.
Moreover, the original algorithm has more defects, for example, what does happen when the curve intersects or passes next to itself ?

**nuzzle** · November 12th, 2011, 12:53 AM

Originally Posted by superbonzo

interpolation would be of little help here;

A little? It will solve the problem as stated by the OP.

Right now he's comparing apples with oranges. To avoid that resampling is necessary and the simplest form is linear interpolation.

do you mean an outlier cutoff ?

That's what I said yes. Garbage in garbage out.

The first step is to remove obviously erratic input data before any further processing takes place. It can be fully automatic but doesn't have to. The best approach often is to warn the user that something may be fishy with the data and ask him to take corrective action. As an alternative or in addition simple median filtering can be used to reduce "salt & pepper" noise.

The second step is to allow variation. The OP uses an interval of tolerance but if it's rigid users will tend to increase it until also very different curves pass as similar. You get false negatives. But even with a high tolerance setting two otherwise very similar curves may differ at only one point, and get rejected as different. You get false positives. To avoid these problems the simplest way is to make the tolerance "fuzzy" by allowing a certain amount of outliers.

---

The simplest way to make curve matching robust and resilient is to implement the steps above. In fact you have too. Washing dirty input data, comparing apples with apples and allowing a certain amount of freak variation are cornerstones. They can be made more sophisticated but that's another story.

I may add that I haven't approached this as a pattern matching or shape recognition or data mining or feature extraction problem where you essentially look for general correlations. If I get the OP right this is overkill. I've looked at the more specific and much simpler problem of deciding whether two data sets (which are recordings of the same physical process and which are supposed to be identical) really can be considered the same. A test for overall equality if you will.

**wdolson** · November 15th, 2011, 09:15 PM

It's been a few days. My customer is pushing me to do a quick and dirty fix to get something out there, then come back later with fancier solutions. One data set is always called the control and the other is sometimes called the device under test (DUT) and other places is just called the Test file. My customer found while testing that he could get better results if he tested the files one way, then turned them around (assigning the control file to test and the test file to control).

I did a quick mod to the test to do two passes and I did get results closer to what his customer expected. That will probably be what we'll have to live with for a while.

I also did an experiment based on something I found in the data file. There is a third data set in there. The measurements are voltages and currents output and the third data set is the input voltages. The input voltages represent a data set that is more predictable than the other two data sets.

I stepped through the input voltage in one file and found the corresponding point in the other (doing straight line interpolation if needed) and then compared the output values using the tolerance. It caught all the bad curves and some that had not been caught be the first method. It's also somewhat faster.

The new curves caught are ones in which the curves look like they are close to on top of one another, but the output voltages and currents for certain points don't match the same input voltage point in the other file, though another point nearby with a different Vin has an output within tolerance.

I don't know if my customer will want to use this, but I've offered it as an option with the two pass method being the default. I guess it's good enough for now.

Thanks for the input, it did help me and I might do something more formal when I get a chance to redo the comparison function entirely in the coming months.

Bill

**nuzzle** · November 16th, 2011, 01:54 AM

Originally Posted by wdolson

The new curves caught are ones in which the curves look like they are close to on top of one another, but the output voltages and currents for certain points don't match the same input voltage point in the other file, though another point nearby with a different Vin has an output within tolerance.

I think you should also try the median filtering I mentioned. Since the curves are recordings of physical processes there will be added noise. Median filtering is a very simple way of reducing so called "salt & pepper" noise, that is sudden short deviations from a trend.

http://en.wikipedia.org/wiki/Median_filter

It should be performed on the raw data before any further processing (like resampling ala linear interpolation) takes place.

**Learner9** · June 2nd, 2012, 01:59 AM

I am unskilled in curve comparisons but I am try to find some formula to compare a dataset with a reference dataset.

I have an application where there are two sets of data. The first one is the reference data ie a sine wave and the second is a set of data trying to follow the shape of the sine wave. If the second data set provided a shape that was very similar to that of the sine wave it would have a relationship of “1” and if the second data set was a straight line it would have a relationship of “0”. Amplitude does not really matter. If the two sets of data are out of phase it does not matter also although both datasets are collected at the same time. It is the amount of similarity of the shape that we are interested in. Some engineers have mentioned that coherence is the type of comparison we must do.

Can you or anyone help me or point me in the right direction?

Thread: Curve Comparison Algorithm?

Thread Tools

Display

Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Re: Curve Comparison Algorithm?

Posting Permissions