C++ General: How is floating point representated?
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 2 of 2

Thread: C++ General: How is floating point representated?

Threaded View

  1. #1
    Join Date
    Oct 2002
    Location
    Timisoara, Romania
    Posts
    14,360

    C++ General: How is floating point representated?

    Q: What is IEEE 754 standard?

    A: IEEE Standard 754 floating point is the most common representation today for real numbers on computers, including Intel-based PC's, Macintoshes, and most Unix platforms.


    Q: Is this the format use by Microsoft VC++ also?

    A: Microsoft Visual C++ is consistent with the IEEE numeric standards. There are three internal varieties of real numbers. Real*4 and real*8 are used in Visual C++. Real*4 is declared using the word float. Real*8 is declared using the word double. In Windows 32-bit programming, the long double data type maps to double. There is, however, assembly language support for computations using the real*10 data type.


    Q: What is the format specified by the standard?

    A: IEEE floating point numbers have three basic components: the sign, the exponent, and the mantissa. The sign bit is 0 for positive, 1 for negative. The exponent's base is two. The exponent field contains 127 plus the true exponent for single-precision, or 1023 plus the true exponent for double precision. The first bit of the mantissa is typically assumed to be 1.f, where f is the field of fraction bits.
    To learn more about the standard see:



    Q: What is the range of real numbers in VC++?

    A:

    float (4 bytes) : 1.175494351E-38 to 3.402823466E+38, significant decimal digits: 6
    double (8 bytes) : 2.2250738585072014E-308 to 1.7976931348623158E+308, significant decimal digits: 15
    real*10 (10 bytes) : 3.37E-4932 to 1.18E+4932, significant decimal digits: 19



    Q: I have a problem with the following code.
    Code:
    int main()
    {
      float a = 2.501f; 
      a *= 1.5134f; 
      if (a == 3.7850134) cout << "Expected value" << endl;
      else cout << "Unexpected value" << endl; 
    }
    Why does the program outputs "Unexpected value", because 2.501 * 1.5134 = 3.7850134?

    A: Floating-point decimal values generally do not have an exact binary representation. This is a side effect of how the CPU represents floating point data. Different compilers and CPU architectures store temporary results at different precisions, so results will differ depending on the details of your environment. If you do a calculation and then compare the results against some expected value it is highly unlikely that you will get exactly the result you intended.

    To summarize, never make such a comparison:
    Code:
    if (a == b) ...
    Instead make sure that the result is greater or less than what is needed, with a given error.
    Code:
    if( fabs(a - b) < error) ...
    The above example should be rewritten like this:
    Code:
    #define EPSILON 0.0001   // Define your own tolerance
    #define FLOAT_EQ(x,v) (((v - EPSILON) < x) && (x <( v + EPSILON)))
    int main()
    {
       float a = 2.501f;
       a *= 1.5134f;
       if (FLOAT_EQ(a, 3.7850)) cout << "Expected value" << endl;
       else cout << "Unexpected value" << endl;
    }
    However, since float has 6 significant decimals you might want to have an EPSILON value not grater than 0.000001. It depends on the tolerance you need. But you cannot use an EPSILON of 0.0000001 because it that case it exceeds the float precision.

    In order to avoid any misleading you should understand that there can be only 6 decimal digits in the result. But this does not imply 0.000001! When dealing with all small values you could just as well have an epsilon of 0.0000000000000001 providing the values compared to are equaly small enough. In the case of a float value of 12345.6789, the float is only reliably correct to the first 6 decimal digits, so, it's at best guaranteed accurate only to 0.1. Using the epsilon macro to an accuracy of 0.0001 may not actually help in establishing equality.

    It is a common misconception that epsilon when dealing with floats is (or can be) an absolute value. It is not! Epsilon (as in the FLT_EPSILON or DBL_EPSILON definitions) is the minimal representable value, but in order to apply it to a result, you have to scale epsilon to the same exponent as the values you are comparing.
    Code:
    // float.h
    #define DBL_EPSILON     2.2204460492503131e-016 /* smallest such that 1.0+DBL_EPSILON != 1.0 */
    #define FLT_EPSILON     1.192092896e-07F        /* smallest such that 1.0+FLT_EPSILON != 1.0 */
    Using the real FLT_EPSILON or DBL_EPSILON definition (scaled to match operands) in order to compare for equality is possible, but requires a considerable amount of code. The FLOAT_EQ() macro is a good easy alternative, but you do need to be aware that it can only perform it's job properly when the values tested are within the accuracy range of the type used (float/double).

    Code:
    float a = 51234.1f;
    a*= 79.6787f;
    
    if (FLOAT_EQ(a,4082266.48367)) ...
    Each of the floats used (551234.1f and 79.6787f) are suffienctly accurate (either float is only 6 decimal digits). The resulting float however has 12 decimal digits. Even though you have attempted to adjust for the inequality with the FLOAT_EQ() macro, it still returns false. In fact, the above returns false up to the point where we set EPSILON to 1.0!



    Q: Why this inaccuracy of floating type representations and not of integer types also?

    A: An integer type number is a string of bits that represent the powers of two, and these powers sum to give the decimal number. For instance 1011 is in decimal 8 + 2 + 1, which is 11.

    On the other hand a floating type number is a string of bits that represent the inverted powers of two. For instance 0.1011 is decimal 1/2 + 1/8 + 1/16, which is 0.6875. While you can accurately represent some decimal values (like 0.5, 0.25, 0.75, 0.625,...) you can't accurately represent all decimals values (like 0.1).



    Q: The following program outputs "Expected value" both in Release and Debug builds. Why?
    Code:
    int main() 
    { 
       float a = 2; 
       a *= 1.5; 
       if (a == 3) cout << "Expected value" << endl; 
       else cout << "Unexpected value" << endl; 
    }
    A: That is because 1.5 has and exact representation in binary: 2^0 + 2^-1, which is 1.1 binary and when you multiply it by 2, increasing the exponent by 1, it yelds an exact value of 3.0 without any rounding.



    Q: But the next program outputs "Expected value" in the Debug build and "Unexpected value" in the Release and I don't know why.
    Code:
    int main()
    {
      float a = 0.1; 
      a*=10; 
      if (a == 1.0) cout << "Expected value" << endl;
      else cout << "Unexpected value" << endl;
    
      return 0;
    }
    A: In Debug build, the value will get actually stored into the stack variable a before comparison. This conversion from the FPU stack to float corrects the floating point error because of rounding. In a Release build, the variable gets optimised away, and the value on the FPU stack is compared to 1.0 (no conversion to float happens).


    Q: Where from can I learn more about floating point comparison?

    A: See the Comparing floating point numbers article by Bruce Dawson.


    Credits: This FAQ was written with the help of OReubens


    Last edited by Andreas Masur; July 23rd, 2005 at 01:04 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Azure Activities Information Page

Windows Mobile Development Center


Click Here to Expand Forum to Full Width

This is a CodeGuru survey question.


Featured


HTML5 Development Center