Thursday, April 7, 2011

Floating Point Math

The old AGC computer used back in the Apollo program to land the Lunar Module ran at a blazing 80kHz and was able to run a Kalman filter. It had no floating point support at all, so it used extensive fixed point math, and an interpreter to do things like matrix multiplication and such. I instead use an LPC2148, which runs at a mere 60MHz, or about 700 times faster. I should be able to run a guidance program with that.

For a long time, nothing I wrote on the Logomatic used floating point numbers. There was no need - everything was integers and addresses, integers and addresses all day long. Then I translated the Kalman filter to it, and everything changed. I figured I would need some fixed point math, and all the bookkeeping headaches that go with that. However, as a first attempt, I just used float to see what would happen. What happened was that my processor utilization climbed from 30% just reading and recording the sensors to about 50% with the gyro filter in place. This might work.

I designed my own matrix code, with just enough functionality to do the filter. No matrix inverse, especially. No in-place multiplication, so lots of scratch space needed. Inline code where I thought it could help. Operations which handle cases where I want to do a matrix times a transpose of a matrix, and operations where the result is known to be scalar.

So, I now have gobs of floating point. In the gyro routine alone, I have a temperature offset corrector (three polynomials, one for each axis). I have the filter itself, all running full-bore floating point matrix math.

Then I come across this. It is the timings for a "fast" math library, that apparently handles transcendentals well, but is worse than GCC for just adding and multiplying. These benchmarks are for a 48MHz ARM7TDMI, the AT91SAM7. All these timings are in microseconds, so smaller is better.

ARM7: AT91SAM7X256-EK, 48 MHz, Internal Flash, GCC v4.4.1
 Double Precision Single Precision Function GoFast GNU GoFast GNU add 4.822 3.806 3.806 2.659 subtract 5.074 3.799 3.779 2.814 multiply 4.674 3.334 3.008 2.057 divide 32.438 22.356 16.650 5.725 sqrt 63.384 50.835 33.136 17.603 cmp 2.843 1.821 2.152 1.533 fp to long 1.949 1.418 1.528 1.294 fp to ulong 1.892 1.184 1.470 1.090 long to fp 2.725 2.742 2.454 2.188 ulong to fp 2.329 2.704 1.941 2.264

I can expect a 60MHz core of the same design to go 25% faster (take 80% of the time) as shown here. First, the "fast" math library is slower than GCC for the operations that the Kalman filter actually uses. Second, sqrt doesn't take very long either. Third, multiply takes a lot less time than divide, especially for double precision. Fourth, double precision takes about 30% longer on average, except for divide.

Fifth and most important, double precision is an option if I feel like I need it. This is the first project I've worked on where I explicitly chose float as my own design decision, rather than to maintain compatibility with another piece of code. On any desktop built since the 486, floating point hardware was standard, and all math is done at greater than double precision once it is on the x87, so you only pay in space, not time, for using double precision.

With a state vector of 7 for orientation only, the scratch space alone for the Kalman filter takes 1344 bytes. No extra data space is needed for more observations, just code. Cranking the state vector up to 13 (6 for translation and 7 for orientation) and double precision uses up 8736 bytes, almost 1/4 of the 32kiB available.

Have I mentioned how much I love the LPC today? I am at 187 of 500kiB of flash and 19 of 40 kiB (32kiB easily usable), so I am still using under 50% of the available space.