Q: What is IEEE 754 standard?

A: IEEE Standard 754 floating point is the most common representation today for real numbers on computers, including Intel-based PC's, Macintoshes, and most Unix platforms.

Q: Is this the format use by Microsoft VC++ also?

A: Microsoft Visual C++ is consistent with the IEEE numeric standards. There are three internal varieties of real numbers. Real*4 and real*8 are used in Visual C++. Real*4 is declared using the wordfloat. Real*8 is declared using the worddouble. In Windows 32-bit programming, thelong doubledata type maps todouble. There is, however, assembly language support for computations using the real*10 data type.

Q: What is the format specified by the standard?

A: IEEE floating point numbers have three basic components: the sign, the exponent, and the mantissa. The sign bit is 0 for positive, 1 for negative. The exponent's base is two. The exponent field contains 127 plus the true exponent for single-precision, or 1023 plus the true exponent for double precision. The first bit of the mantissa is typically assumed to be 1.f, where f is the field of fraction bits.

To learn more about the standard see:

Q: What is the range of real numbers in VC++?

A:

float (4 bytes): 1.175494351E-38 to 3.402823466E+38, significant decimal digits: 6

double (8 bytes): 2.2250738585072014E-308 to 1.7976931348623158E+308, significant decimal digits: 15

real*10 (10 bytes): 3.37E-4932 to 1.18E+4932, significant decimal digits: 19

Q: I have a problem with the following code.

Why does the program outputs "Unexpected value", because 2.501 * 1.5134 = 3.7850134?Code:int main() { float a = 2.501f; a *= 1.5134f; if (a == 3.7850134) cout << "Expected value" << endl; else cout << "Unexpected value" << endl; }

A: Floating-point decimal values generally do not have an exact binary representation. This is a side effect of how the CPU represents floating point data. Different compilers and CPU architectures store temporary results at different precisions, so results will differ depending on the details of your environment. If you do a calculation and then compare the results against some expected value it is highly unlikely that you will get exactly the result you intended.

To summarize, never make such a comparison:

Instead make sure that the result is greater or less than what is needed, with a given error.Code:if (a == b) ...

The above example should be rewritten like this:Code:if( fabs(a - b) < error) ...

However, since float has 6 significant decimals you might want to have an EPSILON value not grater than 0.000001. It depends on the tolerance you need. But you cannot use an EPSILON of 0.0000001 because it that case it exceeds the float precision.Code:#define EPSILON 0.0001 // Define your own tolerance #define FLOAT_EQ(x,v) (((v - EPSILON) < x) && (x <( v + EPSILON))) int main() { float a = 2.501f; a *= 1.5134f; if (FLOAT_EQ(a, 3.7850)) cout << "Expected value" << endl; else cout << "Unexpected value" << endl; }

In order to avoid any misleading you should understand that there can be only 6 decimal digits in the result. But this does not imply 0.000001! When dealing with all small values you could just as well have an epsilon of 0.0000000000000001 providing the values compared to are equaly small enough. In the case of a float value of 12345.6789, the float is only reliably correct to the first 6 decimal digits, so, it's at best guaranteed accurate only to 0.1. Using the epsilon macro to an accuracy of 0.0001 may not actually help in establishing equality.

It is a common misconception thatepsilonwhen dealing with floats is (or can be) an absolute value. It is not! Epsilon (as in the FLT_EPSILON or DBL_EPSILON definitions) is the minimal representable value, but in order to apply it to a result, you have to scale epsilon to the same exponent as the values you are comparing.

Using the real FLT_EPSILON or DBL_EPSILON definition (scaled to match operands) in order to compare for equality is possible, but requires a considerable amount of code. The FLOAT_EQ() macro is a good easy alternative, but you do need to be aware that it can only perform it's job properly when the values tested are within the accuracy range of the type used (float/double).Code:// float.h #define DBL_EPSILON 2.2204460492503131e-016 /* smallest such that 1.0+DBL_EPSILON != 1.0 */ #define FLT_EPSILON 1.192092896e-07F /* smallest such that 1.0+FLT_EPSILON != 1.0 */

Each of the floats used (551234.1f and 79.6787f) are suffienctly accurate (either float is only 6 decimal digits). The resulting float however has 12 decimal digits. Even though you have attempted to adjust for the inequality with the FLOAT_EQ() macro, it still returns false. In fact, the above returns false up to the point where we set EPSILON to 1.0!Code:float a = 51234.1f; a*= 79.6787f; if (FLOAT_EQ(a,4082266.48367)) ...

Q: Why this inaccuracy of floating type representations and not of integer types also?

A: An integer type number is a string of bits that represent the powers of two, and these powers sum to give the decimal number. For instance 1011 is in decimal 8 + 2 + 1, which is 11.

On the other hand a floating type number is a string of bits that represent the inverted powers of two. For instance 0.1011 is decimal 1/2 + 1/8 + 1/16, which is 0.6875. While you can accurately represent some decimal values (like 0.5, 0.25, 0.75, 0.625,...) you can't accurately represent all decimals values (like 0.1).

Q: The following program outputs "Expected value" both in Release and Debug builds. Why?

Code:int main() { float a = 2; a *= 1.5; if (a == 3) cout << "Expected value" << endl; else cout << "Unexpected value" << endl; }A: That is because 1.5 has and exact representation in binary: 2^0 + 2^-1, which is 1.1 binary and when you multiply it by 2, increasing the exponent by 1, it yelds an exact value of 3.0 without any rounding.

Q: But the next program outputs "Expected value" in the Debug build and "Unexpected value" in the Release and I don't know why.

Code:int main() { float a = 0.1; a*=10; if (a == 1.0) cout << "Expected value" << endl; else cout << "Unexpected value" << endl; return 0; }A: In Debug build, the value will get actually stored into the stack variableabefore comparison. This conversion from the FPU stack to floatcorrectsthe floating point error because of rounding. In a Release build, the variable gets optimised away, and the value on the FPU stack is compared to 1.0 (no conversion to float happens).

Q: Where from can I learn more about floating point comparison?

A: See the Comparing floating point numbers article by Bruce Dawson.

Credits: This FAQ was written with the help of OReubens