Floating point math question

**mop** · March 24th, 2003, 10:36 AM

The min and max values for IEEE 754 single precision is 1.1754E-38 and 3.4028E+38 respectively.

This leads one to belive that as long as I stay within the constraints specifed, the program should - for the most part - produce the correct result.

Referencing the program below (2.0e20 + 1) - 2.0e20, does not produced the correct result (ie 1.00000), however (2.0e6 + 1) - 2.0e6 does. Did some reading and realized that in order to add two floats the exponents must be the same and there's a normalization process that the values go through such that if the difference between the exponents is greater than the number of
digits of precision, the value of the smaller number will drop to 0 by the time the exponents are the same. The question then becomes how would I avoid 'situations' in my prgroam such that the difference between the exponents IS NOT greater than the number of digits of precision?
I'm using a 32 bit fixed point processor that has libraries for doing floating point math. For benchmarking purposes I suspect the largest floating point values I could multiply that'll produce the correct result is 3.4028E+38 * 3.4028E+38?

Thanks for the assistance

Code:

#include "stdafx.h"
#include "stdio.h"
#include "math.h"


int main(int argc, char* argv[])
{

	float a, b, rel_diff;

//	b = 2.0e7 + 1;      -- doesnt work
//	a = b - 2.0e7;

//	b = 2.0e8 + 1.0;   -- doesnt work
//	a = b - 2.0e8;	

	b = 2.0e20 + 1;    // doesnt work
	a = b - 2.0e20;	


//	b = fabs(1.0e20) + 1.0;
//	a = fabs(b) - fabs(1.0e20);

//	b = 2.0e6 + 1.0e6;   // works
//	a = b - 2.0e6;	


//	rel_diff = (b - a)/ (a + b);  // or MAX(a,b)

	printf("%f \n" , a);

	return 0;
}

**dude_1967** · March 24th, 2003, 01:28 PM

mop,

Here is a simple routine written in plpain C for checking if the exponents of floats are within range. I wrote it quickly so you might have to check for some simple errors if you decide to use this routine. The code sample might give some ideas on how to address this type of problem.

Sincerely, Chris.

Code:

#include <stdio.h>
#include <limits.h>
#include <math.h>
#include <stdlib.h>

#ifndef FLT_DIG
  #define FLT_DIG 6
#endif

const int check_exp_range(const float* pu, const float* pv)
{
  char cu[20];
  char cv[20];

  sprintf(cu, "%.1e", *pu);
  sprintf(cv, "%.1e", *pv);

  int exp_u = atoi(cu + 4);
  int exp_v = atoi(cv + 4);

  if(exp_u - exp_v > 0)
  {
    return exp_u - exp_v <=  FLT_DIG;
  }
  else
  {
    return exp_u - exp_v <= -FLT_DIG;
  }

}

int main(int argc, char* argv[])
{
  float f1, f2;
  int i;
  
  f1 = (float) 1.0;
  f2 = (float) 1.0e-5;
  i  = check_exp_range(&f1, &f2);

  f1 = (float) 1.0;
  f2 = (float) 1.0e-10;
  i  = check_exp_range(&f1, &f2);

  return 1;
}

**galathaea** · March 24th, 2003, 01:45 PM

The standard library numeric_limits exposes (I believe its a function, but it could be a traits constant) epsilon. If you just multiply the larger value by epsilon and find that it is less than the lower value, you can add and subtract the two values to the precision of the larger value. Otherwise, addition and subtraction will be unnoticed.

**Graham** · March 24th, 2003, 01:48 PM

mop: you should also check out what the IEEE standard says about the number of signigicant digits. Remember that 32 bits can only represent approximately 4x10^9 distinct values, from which it's obvious that it cannot represent every floating point number within the range specified (that would be impossible anyway, since there are at least aleph-1 real numbers). There are huge gaps in the sequence. For low magnitudes, there is pretty good coverage of the integral numbers, but as the magnitude starts to exceed the number of significant digits, so the representable values get sparser and sparser. I think the sig digs figure for single precision is 7 or so. This means that you can just about distinguish r from r + 1 when r is around 10^6, but at 10^20 there just isn't the resolution to do it - you probably couldn't distinguish between r and r + 10^6 at that sort of magnitude of number.

**Philip Nicoletti** · March 24th, 2003, 01:58 PM

Maybe this will give you some ideas of what is going on

Code:

#include <iostream>
#include <iomanip>

using namespace std;

void test1()
{
    float a,b;

    b = 2.0e+20;
    a = b - 2.0e+20;

    cout << setprecision(25) << b << endl;
    cout << setprecision(25) << a << endl;
    cout << endl;
}

void test2()
{
    float a,b;

    b = 2.0e+20f;
    a = b - 2.0e+20;

    cout << setprecision(25) << b << endl;
    cout << setprecision(25) << a << endl;
    cout << endl;
}

void test3()
{
    float a,b;

    b = 2.0e+20f;
    a = b - 2.0e+20f;

    cout << setprecision(25) << b << endl;
    cout << setprecision(25) << a << endl;
    cout << endl;
}

void test4()
{
    float a,b;

    b = 2.0e+20;
    a = b - 2.0e+20f;

    cout << setprecision(25) << b << endl;
    cout << setprecision(25) << a << endl;
    cout << endl;
}


void test5()
{
    double a,b;

    b = 2.0e+20;
    a = b - 2.0e+20;

    cout << setprecision(25) << b << endl;
    cout << setprecision(25) << a << endl;
    cout << endl;
}


int main(int argc, char* argv[])
{
    test1();
    test2();
    test3();
    test4();
    test5();

    return 0;
}

**Gorgor** · March 26th, 2003, 02:47 PM

A simple analog in decimal is that the floating point number system only maintains the X most significant digits plus an exponent.

As long as the number adding or subtracting is within those significant digits, the arithmetic will "take". Otherwise it won't.

To see why, consider: If the floating representation can handle a "ones" digit plus 2 decimal digits (X = 3), then:

67,284 turns into 6.72e4.

Note the "84" gets lost since the internal representation can only handle the "ones" plus 2 decimal digits.

If you try to add 1 (or even 99) to it, well, 1 in e4 representation is 0.00e4, so you actually add (or subtract) 0, and thus the number doesn't change. The exponents do NOT have to be the same logically (though they may, perhaps, in the internal implementation of the chip) -- you're just trying to add or subtract something that is too small, like adding a bacteria to a whale that is on a scale that measures only to the nearest 100 pounds accuracy.

In your example, 1 is too small to be seen by an e20 number (whale + bacteria on industrial scale), while 1 is big enough to be seen by an e6 number (dust mite + bacteria on a sensitive scale.)

Thread: Floating point math question

Thread Tools

Display

Floating point math question

Check exponent ranges

Analog

Posting Permissions