# Floating-point optimisation for small devices, part 1

Ok, I will begin today a new serie of articles dealing with one specific field of optimisation: floating point calculus.

I am working as a software engineer in the mobile field. My daily work involves coding efficient algorithms on small devices -primarily telephones and handhelds based on ARM cpus. Coding for those machines is very rewarding! I am an old school guy, and it reminds me the glorious days of the Amiga. Most ARM-based devices still don't have any FPU, even if most of them now run at slightly faster frequencies than my old Amiga 1200 used to.

Early optimisation is the root of all Evil

, as Michael Abrash used to say. However, when the project you're working on stops being a prototype, there are some days you stop being an *engineer* as well and you get back to your *hacker* roots. I cherish those days. I like when my code runs fast. I like it even more when it's because I've found some magic trick to make it run faster. Sometimes I find this magic on Google, sometimes I create it. Both can be pleasant in their own way.

You won't find any assembly code in this serie of articles. I truly believe that the days where hand coded assembly could be faster than compiler generated assembly are gone, and I won't regard anyone who'd tell me how wrong I am as serious. Considering the complexity of modern processors, I would always trust my compiler far more than my poor tired brain :-)

So let's begin with some simple bits. More complex things will come later...

## Representing IEEE-754 single precision floating point values

IEEE-754 defines a 32-bit format for representing floating point values. A very good explanation about the standard can be found at Wikipedia, so I won't write one here.

What we ideally want is to perform floating point calculus by using integer math only. A single precision IEEE-754 `float`

number uses 32 bits, just like an `int`

or an `unsigned int`

does. So let's define a type we will use in many functions from now:

/// @brief The same 32-bit value accessed in 3 different representations. typedef union { float fval; unsigned int uval; int ival; } float_bits_t;

## Explicit and implicit float values comparison

If you've read the aforementionned Wikipedia article, you know that a float's sign bit is located at bit #31, exactly where it is when you're dealing with integer values. So here are the first two methods I will give you:

/** @brief Tells if a number is negative, or not. * @param x Some floating-point value. * @result true if x is negative, false otherwise. */ inline bool IsNegative(float x) { float_bits_t fb; fb.fval = x; return (fb.uval & 0x80000000)!=0; } /** @brief Returns the absolute value of the given argument. * @param x Some floating-point value. * @result Absolute value of x. */ inline float AbsoluteValue(float x) { float_bits_t fb; fb.fval = x; fb.uval = fb.uval & 0x7fffffff; return fb.fval; }

Simple. The `IsNegative()`

method gives us a quick way to tell weither a number is negative or not, without comparing its value with `0.0f`

. Speaking of comparing two values, we can re-use it in the following methods:

/** @brief Tells if one value is (strictly) greater than another. * @param a Some floating-point value. * @param b Some floating-point value. * @result true if a is greater than b, false otherwise. */ inline bool IsGreaterThan(float a, float b) { return IsNegative(b-a); } /** @brief Returns the minimum of two values. * @param a Some floating-point value. * @param b Some floating-point value. * @result b if a is greater than a, a otherwise. */ inline float Minimum(float a, float b) { return IsGreaterThan(a,b)?b:a; } /** @brief Returns the maximum of two values. * @param a Some floating-point value. * @param b Some floating-point value. * @result a if a is greater than a, b otherwise. */ inline float Maximum(float a, float b) { return IsGreaterThan(a,b)?a:b; }

We're almost done with floating point values comparison. The only lacking method is `IsEqualTo()`

, i.e. the one which tests equality. One trivial implementation could be:

/** @brief First implementation of IsEqualTo() * @param a Some floating-point value. * @param b Some floating-point value. * @result true is a and b ara equal to each other, false otherwise. */ inline bool IsEqualTo(float a, float b) { return AbsoluteValue(a-b)<=Epsilon; }

However, this method makes use of a floating point substract operation, and we'd like to stick with integer arithmetics. Second try, using integers:

/** @brief Second implementation of IsEqualTo() * @param a Some floating-point value. * @param b Some floating-point value. * @result true is a and b are equal to each other, false otherwise. */ inline bool IsEqualTo(float a, float b) { float_bits_t fba; fba.fval = a; float_bits_t fbb; fbb.fval = b; return (fba.uval==fbb.uval); }

This one seems good, except it cannot deal with zeroes. In IEEE-754 standard, zeroes can be either positive or negatives. Look at the following chart:

Value | Hexadecimal | Decimal |
---|---|---|

+2.8025969e-045 | 0x00000002 | 2 |

+1.4012985e-045 | 0x00000001 | 1 |

+0.00000000 | 0x00000000 | 0 |

-0.00000000 | 0x80000000 | -2147483648 |

-1.4012985e-045 | 0x80000001 | -2147483647 |

-2.8025969e-045 | 0x80000002 | -2147483646 |

So, this second `IsEqualTo()`

function will return false when called with two zero values with different signs. The usual way to deal with this kind of problem arising when manipulating negative values is to adjust them so that they are lexicographically ordered as twos-complement integers instead of as signed magnitude integers. This is done by detecting negative numbers and subtracting them from 0x80000000. This maps negative zero to an integer zero representation – making it identical to positive zero – and it makes it so that the smallest negative number is represented by negative one, and downwards from there. With this change the representations of our numbers around zero look much more rational:

Value | Hexadecimal | Decimal |
---|---|---|

+2.8025969e-045 | 0x00000002 | 2 |

+1.4012985e-045 | 0x00000001 | 1 |

+0.00000000 | 0x00000000 | 0 |

-0.00000000 | 0x00000000 | 0 |

-1.4012985e-045 | 0xFFFFFFFF | -1 |

-2.8025969e-045 | 0xFFFFFFFE | -2 |

/** @brief Third implementation of IsEqualTo() * @param a Some floating-point value. * @param b Some floating-point value. * @result true is a and b are equal to each other, false otherwise. */ inline bool IsEqualTo(float a, float b) { float_bits_t fba; fba.fval = a; float_bits_t fbb; fbb.fval = b; // Make fba.ival lexicographically ordered as a twos-complement int if (fba.ival < 0) fba.ival = 0x80000000 - fba.ival; // Make fbb.ival lexicographically ordered as a twos-complement int if (fbb.ival < 0) fbb.ival = 0x80000000 - fbb.ival; // Check equality return (fba.ival == fbb.ival) }

This is better! However, we have introduced a lot of tests! (yes, *two* tests is *a lot* as it might break code branch prediction) Let's remove those tests using masks and the sign bit (here comes the magic):

/** @brief Tests if two values are equal to each other. * @param a Some floating-point value. * @param b Some floating-point value. * @result true is a and b are equal to each other, false otherwise. */ inline bool IsEqualTo(float a, float b) { float_bits_t fba; fba.fval = a; float_bits_t fbb; fbb.fval = b; // If the two values have different signs , the following line will // result in 0x00000000. // Otherwise (same sign) it will result in 0xffffffff. int tmp = (int)((fba.uval^fbb.uval)>>31) - 1; // If signs are different, then tmp == 0 and the following will // equivalent to: // tmp = (0x80000000 - fba.ival) - fbb.ival; // Else there is no need for two-complement reordering, so it // will be equivalent to: // tmp = fba.ival - fbb.ival; tmp = (((0x80000000 - fba.ival)&(~tmp)) | (fba.ival&tmp)) - fbb.ival; // tmp now holds the delta between the two (maybe) re-ordered values, // so we can check for equality: return (tmp==0); }

The first implementation of `IsGreaterThan(a,b)`

that I presented above calls `IsNegative(b-a)`

, thus making use of a floating point substract too. We can rewrite it using the same trick:

/** @brief Tells if one value is (strictly) greater than another. * @param a Some floating-point value. * @param b Some floating-point value. * @result true if a is greater than b, false otherwise. */ inline bool IsGreaterThan(float a, float b) { float_bits_t fba, fbb; fba.fval = a; fbb.fval = b; int ta = (int)(fba.uval>>31)-1; int tb = (int)(fbb.uval>>31)-1; return (int)(((0x80000000 - fba.ival)&(~ta)) | (fba.ival&ta)) > \ (int)(((0x80000000 - fbb.ival)&(~tb)) | (fbb.ival&tb)); }

## Concluding remarks

That's enough for today. In this conclusion, I would like to state some rules I think you should follow when targeting small devices:

- Branches are bad. Try to use arithmetics instead of
`if()`

clauses. Using`result=test?value1:value2;`

is better as it tranlates easily to i686's`CMOV`

opcode or ARM's opcode conditional field, but plain integer arithmetic is often the best choice. - Still, don't forget ARM processors do not have any integer division operation!
- Do not believe me when I say
This is the fastest way to...

. There is always a faster way! If it has yet to be found, please be the first one to discover it and report it back to me :-)

I hope you enjoyed reading this article. In the next one I will show you how to test IEEE-754 special values like infinity and NaNs, how to check if a value can be safely used as a denominator, and how to compute square roots and reciprocals incredibly faster than you used to.

## Leave a Comment