Know Your FPU, 2006 Edition

Know your FPU: Fixing Floating Fast

Sree Kotay, 2006
Come see me :)

This is roughly six years overdue, but I wanted to share a more robust version of my favorite Float to Int technique and a broader version of the code that Mike Herf posted on his most excellent site. That article is a good primer, and I tried to avoid repeating too much of the same material here.

Truthfully, there are a lot more floating point tricks than just conversions to integers, but I think most people would be surprised just how much of a bottleneck this one simple conversion is, even on modern processors - especially for multimedia coding (graphics, video, music, sound, speech, etc.).

It was easily a 2x difference in my (software) 3d renderer when I'm geometry bound. And in my 2d vector rendering code, its often the difference between between being geometry bound and not . Just for context: when I mention "production code" in this article, I don't mean my private HaXor elite graphics code that will p0wn your lame commerical product: I mean the variety of products in the market place that continue to use my realtime graphics and/or UI code on a variety of platforms, so I think I can speak with some small authority on the topic. Further, I think its worth some (more) exploration exactly because its a transparent, common, but expensive operation for realtime multimedia coders.

There's a LOT thats been written on this subject. Probably the most complete (number of techniques, not complete descriptions) is at the Fluid Studios website. (aside from Herf's of course :)).

That said, the thing I've noticed about most people's conversions (or at least their explanations of them) is that they tend to be functionally and numerically, um, incomplete, for lack of a better term. By that I mean, they don't round out the conversion routines you'd like (pun intended :)), and their numerical validity is generally a little unclear.

For example, if you use the "fistp in rounding mode" conversion technique that almost everyone recommends, and you're writing graphics code, you probably have a subtle rounding bug you didn't know you had. I know I did. In addition, understanding this particular floating point trick (and its variants) is a good way to start helping you construct your own.

In this article, I'll compare some fast floating point to integer conversions, extend one of the techniques to ultrafast conversion directly to fixed point, and provide substantially faster floor() and ceil() functions, all with an eye to making the arithmetic implications and tradeoffs clear.

Can you snatch the mantissa from my hand, grasshopper?

The smartest one of all!!!
To cut to the chase, here's the performance results for basic conversion I see on my machine (Pentium IV 1.2ghz):

Performing 10000000 times:
simple cast 2819 ms i.e. i = (long)f; xs_ToInt 1242 ms i.e. i = xs_ToInt(f); //numerically same as above bit-twiddle(full) 1093 ms i.e. i = BitConvertToInt(f); //rounding from Fluid fistp 676 ms i.e. i = FISTToInt(f); //Herf, et al x86 Assembly rounding bit-twiddle(limited) 623 ms i.e. i = FloatTo23Bits(f); //Herf, rounding only in the range (0...1] xs_CRoundToInt 609 ms i.e. i = xs_CRoundToInt(f); //rounding with "magic" numbers

I can verify that these results are basically the same as what I see in production graphics code, across processor speeds/types (of the x86 variety), but this particular set of timings came from an included benchmark utility. If anything, I find that the "magic number" technique that xs_CRoundToInt() uses tends to pipeline much better that fistp, resulting in a bigger gain in the real world than the benchmark demonstrates.
Explanations to follow, but the important part is that the technique I like, xs_CRountToInt(), is fastest, more flexible (more on this coming), and more portable: across CPU types, and even across code and compilers - for example, it doesn't require any assembly or processor-level floating point flags to be set for correctness.
[Aside: xs_ToInt() uses the same tricks ast xs_CRountToInt(), but matches the direct C cast to integer for correctness. It has a compare that's required to match the ANSI/C Standard - and branching turns out to be unsurprisingly pretty bad for performance.]
So why is this so much faster than the simple cast?
First, you should understand the folks who write compilers are NOT dumb. In fact, they are, more likely than not, smarter than you (and I put myself in the "you" bucket, just to be clear). So why is their version - the code generated by the compiler to perform this operation - not just slow, but real slow?
The answer's pretty pedestrian.
In order to make floating point (single, double, extended, et al) as accurate and as fast as possible, there are a number of truncation and precision choices that the IEEE proscribed for performing arithmetic - after all, even extended precision floating point numbers are a finite, fixed number of bits. In particular, arithmetic truncation because of limited precision will round towards or away from 0 for any particular math operation (multiply, divide, and normalization in particular) in a deterministic but "averaging" way. Simply put (over-simply, really, but good enough for our purposes), some numbers round up, some down - and numbers at the "boundary" (i.e. exactly halfway between) don't always round up as they did when you were learning about rounding in your high school math class. Actually, the rounding in these cases is always towards the "even" number.
Now this is a fine rule. It even has a name: Banker's Rounding, so clearly it must be fine - bankers use it! Incidentally, that's why I distinguish the "CRound()" from the just "Round()" versions of my conversion routines - the former uses Banker's rounding (like C) where the latter is arithmetic rounding.
The net result of Banker's rounding is that numerical errors in floating point math tend to, on average, just come out in the wash. If you're doing scientific computing, you probably need to care, but the rest of us don't really.

Except.

Except that the IEEE rounding/truncation rules that turn out to be good for "better" consistency with floating point numbers don't actually match what the ANSI/ISO C specification requires for float to integer conversions. And much hilarity ensues because of it.

This is not to say that ANSI folks were off - quite the contrary, it makes good sense, and MOST of the time isn't a big deal. Unless you're writing graphics/audio/ray-tracing/processing code (i.e. multimedia) on a modern processor.

Then its a big deal.

So basically, the straight float to int cast is slow because it does a WHOLE lot of work to ensure that it is always correct (correctness, in this case being, "match the C Spec"). And this is unfortunately at odds with generally not propagating rounding errors during floating point arithmetic. The compiler cast is slow because you can be fast, or you can be right. Guess which one they chose?

Now, the obvious and easy question is this: "I'm writing [graphics][audio][ray-tracer][etc] code, I don't need to match the spec - can't we just do it fast?"

The answer, obviously, is "Yes."

And by "Yes", I mean that we can be both much faster, and correct more than enough of the time for our purposes. Remember, we're not doing scientific computing, although exactly understanding BOTH parts of the previous sentence is still important.

This is where most people default to "fistp" as the answer. Its the built-in x86 instruction for converting floating point numbers, so its seems like a pretty reasonable choice - pretty fast, if non-portable (even across compilers can be a pain), but it has a few gotchas. For one, its behaviour depends on the floating point mode. Now, you could just set the mode when your app starts, and use it willy-nilly whereever you need to cast quickly. The problem is that (1) you may have libraries, code, even third party DLLs that you're inadvertently changing the mode for - its set per thread, and (2) you're changing the math for any and all floating point math (multiply, divide, and normalization in particular).

Eeek.

You could try to wrap it locally, where you have hotspots, and that's not a bad option. Or more simply, you could just change your routines so that they work correctly with "rounded" conversions as opposed to straight truncation....

...Which is what you should do - I don't suggest using any of these conversions routines as global conversions; Just treat hotspots and adjust so that rounding works for you... And I really don't recommend setting the FPU round mode except by clear and extreme exception.

However.

However, its not the only option - and it doesn't solve what, for me, turns out to be a pretty core problem in multimedia programming: conversion directly to fixed point.

I'm not going to to get into the fun details of fixed point (if you're not sure what it is, then I take back the thing I said about putting myself in the "you" bucket), but suffice it say that it turns out to be a specific, big hotspot in accurate multimedia coding - as much, if not more so than, than just your conversions to integer.

Additionally, if you want to do other floating "conversions" (operations, really: ceil, floor, etc.) and their respective conversions to fixed (ceil to fixed, floor to fixed, etc.), then you'll end up doing a whole bunch of FPU mode settings, state management etc. (high complexity), or you're back pretty much where the compiler guys left you - 80 cycles to set the FPU round mode, flush the state, convert, and set it back.... yikes.

That's where the "magic number" conversion techniques come in.

Magic Numbers.

Poof.

Six quadrillion, seven hundred fifty five trillion, three hundred ninety nine billion, four hundred forty one million, fifty five thousand, seven hundred forty four (also known as 2 raised to the 52nd power multiplied by 1.5, or 6755399441055744.0). See why its magical? OK, maybe not so much...

The nitty gritty mechanisms of IEEE floats and magic numbers have been covered in detail by others, most clearly by Chris Hecker (IMHO), but the root idea of "magic number" conversions is not too complicated.

Basically, in order to add two floating point numbers, your processor "lines up" the decimal points of the numbers so that it can easily add the bits. It does this by "normalizing" the numbers such that the most significant bits are preserved, i.e. the smaller number "normalizes" to match the bigger one.

So the principle of the "magic number" conversion that xs_CRoundToInt() uses is this: We add a big enough floating point number (a number that is so big that there are significant digits only UP TO the decimal point, and none after it) to the one you're converting such that: (a) the number gets normalized by the processor to its integer equivalent and (b) adding the two does not erase the integral significat bits in the number you were trying to convert (i.e. XX00 + 00YY = XXYY).
And then, you can just "read" that integral component right out of the lower (32) bits into your integer. Since a double precision number has 52 bits of mantissa (plus 11 bits exponent, 1 bit sign), the magic number for doubles turns out to be 2 to the 52nd power (times 1.5, to deal with some sign arithmetic) - AKA our "magic number". As I said, RTFA because we're just getting warmed up.

Straight to Fixed.

Again, to cut to the chase, here are timings on my machine, and I can again confirm the results in production code - though in this case, the fistp to fixed conversion performs MUCH worse in the benchmark than in production because of pipelining issues (i.e. its slower in my renderer, too, but not quite as badly).

Performing 10000000 times:
simple cast convert 3186 ms i.e. fi = (f*65536); fistp convert 3031 ms i.e. fi = FISTToInt(f*65536); xs_ToFix 622 ms i.e. fi = xs_Fix<16>::ToFix(f);

After all, if you can use the FPU's normalization to get to an integer, you can use it just as easily to get to an integer shifted by any power of 2 - just EXACTLY what you need to convert to a fixed point number. So rather than using 2 to the 52nd power (times 1.5), if we want 16 bits of fixed point goodness, we'd use 2 to the 36th power (times 1.5), otherwise known as 2 to the (52-16)th power.

So its easy to convert any floating point number to any fixed point number.

And its fast and pipelines well.

But wait, there's more.

I promised a discussion of numerical accuracy as well. But at this point, that might have to be a follow-up.... at least in any great depth.

The routine xs_CRoundToInt() produces exactly the same results as fistp, i.e. it uses Banker's rounding. Be careful with as it can be subtle and, in fact, just recently bit me in my grid fitting code - you can actually see the bug in action if you resize windows in AIM Triton 1.0; they appear to sometimes lose a pixel at the bottom or right edge.

I should have used the routine xs_RountToInt(), which is (just about nearly) as fast, but uses "true" rounding, i.e. as in your high school math class, though I should point out that its not accurate to scientific precision. Still, it is MORE than good enough for even high precision ray tracing - last I had heard some of the exluna folks were still using it (they're actually the ones that prompted me to finally write this all up - although that was *cough* ... 3 years ago...).

The routine xs_ToInt() matches the C cast, but is more than twice as fast (its actually quite a bit better than 2x in the real world - but close enough), and doesn't require setting the FPU rounding mode. Ditto in terms of precisions as xs_RountToInt().

The xs_Fixed() point conversion routines and xs_FloorToInt() and xs_CeilToInt() are 5 to 10x faster than alternatives, and again Ditto (is that redundantly repeating?) for precision.

Final Thoughts.

Remember that xs_CRountToInt() and fistp produce EXACTLY the same results. The function xs_RoundToInt() is subtly different: it does arithmetic rounding instead of Banker's rounding. In all likelihood, that's probably what you actually thought you were doing with fistp, if you're doing graphics/multimedia (I know that's what I thought) and so I actually recommend that's the one you use everywhere.

Although most of this discussion was about comparisons to fistp (and just plain old C) on an x86, I've found the tricks just as valuable on Sparc powered boxes, 68k processors, etc. (especially the direct-to-fixed conversions on the latter).

On the PowerPC, however, it didn't really matter. It wasn't really slower, but it wasn't really faster, other than floor() and ceil()...

Below are some times from my machine (1.2 ghz Penitum IV) and Herf's machine (3.2ghz Pentium D), just for comparision, followed by code.

Enjoy - lemme know if you make it better.

Timings: Pentium IV/1.2
starting float conversion testing...
Testing ANSI cast() ... Time = 2772.65 ms
Testing SREE Int()  ... Time = 1235.65 ms

Rounding tests...
Testing BitConvert      ... Time = 1089.940 ms
Testing fistp           ... Time =  679.269 ms
Testing SREE RoundInt() ... Time =  677.468 ms
Testing BitConvert23    ... Time =  627.809 ms
Testing SREE CRoundInt()... Time =  622.519 ms

Fixed tests...
Testing ANSI fixed() ... Time = 2974.57 ms
Testing fistp fixed()... Time = 3100.84 ms
Testing Sree fixed() ... Time =  606.80 ms

Floor tests...   
Testing ANSI floor()... Time = 7719.400 ms   
Testing SREE Floor()... Time =  687.177 ms


Timings: Pentium D/3.2
starting float conversion testing...
Testing ANSI cast() ... Time = 1948.470 ms
Testing SREE Int()  ... Time =  933.663 ms

Rounding tests...
Testing BitConvert      ... Time = 612.314 ms
Testing fistp           ... Time = 523.792 ms
Testing SREE RoundInt() ... Time = 378.007 ms
Testing BitConvert23    ... Time = 338.638 ms
Testing SREE CRoundInt()... Time = 333.033 ms

Fixed tests...
Testing ANSI fixed() ... Time = 2163.56 ms
Testing fistp fixed()... Time = 7103.66 ms
Testing Sree fixed() ... Time =  335.03 ms

Floor tests...
Testing ANSI floor()... Time = 3771.55 ms
Testing SREE Floor()... Time = 380.25 ms

Source Code:

xs_fixfloat.zip

// ====================================================================================================================
// ====================================================================================================================
//  xs_Float.h
// ====================================================================================================================
// ====================================================================================================================
#ifndef _xs_FLOAT_H_
#define _xs_FLOAT_H_

 
#include "xs_Config.h"

 
// ====================================================================================================================
//  Defines
// ====================================================================================================================
#ifndef _xs_DEFAULT_CONVERSION
#define _xs_DEFAULT_CONVERSION      0
#endif //_xs_DEFAULT_CONVERSION

 
#if _xs_BigEndian_
       #define _xs_iexp_                        0
       #define _xs_iman_                        1
#else
       #define _xs_iexp_                        1       //intel is little endian
       #define _xs_iman_                        0
#endif //BigEndian_

 
#define _xs_doublecopysgn(a,b)      ((int32*)&a)[_xs_iexp_]&=~(((int32*)&b)[_xs_iexp_]&0x80000000) 
#define _xs_doubleisnegative(a)     ((((int32*)&a)[_xs_iexp_])|0x80000000) 

 
// ====================================================================================================================
//  Constants
// ====================================================================================================================
const real64 _xs_doublemagic             = real64 (6755399441055744.0);      //2^52 * 1.5,  uses limited precisicion to floor
const real64 _xs_doublemagicdelta        = (1.5e-8);                         //almost .5f = .5f + 1e^(number of exp bit)
const real64 _xs_doublemagicroundeps     = (.5f-_xs_doublemagicdelta);       //almost .5f = .5f - 1e^(number of exp bit)

 
// ====================================================================================================================
//  Prototypes
// ====================================================================================================================
int32 xs_CRoundToInt      (real64 val, real64 dmr =  _xs_doublemagic);
int32 xs_ToInt            (real64 val, real64 dme = -_xs_doublemagicroundeps);
int32 xs_FloorToInt       (real64 val, real64 dme =  _xs_doublemagicroundeps);
int32 xs_CeilToInt        (real64 val, real64 dme =  _xs_doublemagicroundeps);
int32 xs_RoundToInt       (real64 val);

 
//int32 versions	 	- just to make macros and templates easier
int32 xs_CRoundToInt      (int32 val)   {return val;}
int32 xs_ToInt            (int32 val)   {return val;}

 


// ====================================================================================================================
//  Fix Class
// ====================================================================================================================
template <int32 N> class xs_Fix
{
public:
    typedef int32 Fix;

 
    // ====================================================================================================================
    //  Basic Conversion from Numbers
    // ====================================================================================================================
    finline static Fix       ToFix       (int32 val)    {return val<<N;}
    finline static Fix       ToFix       (real64 val)   {return xs_ConvertToFixed(val);}

 
    // ====================================================================================================================
    //  Basic Conversion to Numbers
    // ====================================================================================================================
    finline static real64    ToReal      (Fix f)        {return real64(f)/real64(1<<N);}
    finline static int32     ToInt       (Fix f)        {return f>>N;}

 
protected:
    // ====================================================================================================================
    // Helper function - mainly to preserve _xs_DEFAULT_CONVERSION
    // ====================================================================================================================
    finline static int32 xs_ConvertToFixed (real64 val)
    {
    #if _xs_DEFAULT_CONVERSION==0
        return xs_CRoundToInt(val, _xs_doublemagic/(1<<N));
    #else
        return (long)((val)*(1<<N));
    #endif
    }
};





 
// ====================================================================================================================
// ====================================================================================================================
//  Inline implementation
// ====================================================================================================================
// ====================================================================================================================
finline int32 xs_CRoundToInt(real64 val, real64 dmr)
{
#if _xs_DEFAULT_CONVERSION==0
    val              = val + dmr;
    return ((int32*)&val)[_xs_iman_];
#else
    return int32(floor(val+.5));
#endif
}

// ====================================================================================================================
finline int32 xs_ToInt(real64 val, real64 dme)
{
    /* unused - something else I tried...
            _xs_doublecopysgn(dme,val);
            return xs_CRoundToInt(val+dme);
    */

#if _xs_DEFAULT_CONVERSION==0
       return (val<0) ?   xs_CRoundToInt(val-dme) : xs_CRoundToInt(val+dme);
#else
    return int32(val);
#endif
}

// ====================================================================================================================
finline int32 xs_FloorToInt(real64 val, real64 dme)
{
#if _xs_DEFAULT_CONVERSION==0
    return xs_CRoundToInt (val - dme);
#else
    return floor(val);
#endif
}

// ====================================================================================================================
finline int32 xs_CeilToInt(real64 val, real64 dme)
{
#if _xs_DEFAULT_CONVERSION==0
    return xs_CRoundToInt (val + dme);
#else
    return ceil(val);
#endif
}

// ====================================================================================================================
finline int32 xs_RoundToInt(real64 val)
{
#if _xs_DEFAULT_CONVERSION==0
    return xs_CRoundToInt (val + _xs_doublemagicdelta);
#else
    return floor(val+.5);
#endif
}


 
// ====================================================================================================================
// ====================================================================================================================
#endif // _xs_FLOAT_H_