For a long time, nothing I wrote on the Logomatic used floating point numbers. There was no need  everything was integers and addresses, integers and addresses all day long. Then I translated the Kalman filter to it, and everything changed. I figured I would need some fixed point math, and all the bookkeeping headaches that go with that. However, as a first attempt, I just used
float
to see what would happen. What happened was that my processor utilization climbed from 30% just reading and recording the sensors to about 50% with the gyro filter in place. This might work.I designed my own matrix code, with just enough functionality to do the filter. No matrix inverse, especially. No inplace multiplication, so lots of scratch space needed. Inline code where I thought it could help. Operations which handle cases where I want to do a matrix times a transpose of a matrix, and operations where the result is known to be scalar.
So, I now have gobs of floating point. In the gyro routine alone, I have a temperature offset corrector (three polynomials, one for each axis). I have the filter itself, all running fullbore floating point matrix math.
Then I come across this. It is the timings for a "fast" math library, that apparently handles transcendentals well, but is worse than GCC for just adding and multiplying. These benchmarks are for a 48MHz ARM7TDMI, the AT91SAM7. All these timings are in microseconds, so smaller is better.
ARM7: AT91SAM7X256EK, 48 MHz, Internal Flash, GCC v4.4.1
Double Precision

Single Precision
 
Function 
GoFast

GNU

GoFast

GNU

add 
4.822

3.806

3.806

2.659

subtract 
5.074

3.799

3.779

2.814

multiply 
4.674

3.334

3.008

2.057

divide 
32.438

22.356

16.650

5.725

sqrt 
63.384

50.835

33.136

17.603

cmp 
2.843

1.821

2.152

1.533

fp to long 
1.949

1.418

1.528

1.294

fp to ulong 
1.892

1.184

1.470

1.090

long to fp 
2.725

2.742

2.454

2.188

ulong to fp 
2.329

2.704

1.941

2.264

I can expect a 60MHz core of the same design to go 25% faster (take 80% of the time) as shown here. First, the "fast" math library is slower than GCC for the operations that the Kalman filter actually uses. Second, sqrt doesn't take very long either. Third, multiply takes a lot less time than divide, especially for double precision. Fourth, double precision takes about 30% longer on average, except for divide.
Fifth and most important, double precision is an option if I feel like I need it. This is the first project I've worked on where I explicitly chose float as my own design decision, rather than to maintain compatibility with another piece of code. On any desktop built since the 486, floating point hardware was standard, and all math is done at greater than double precision once it is on the x87, so you only pay in space, not time, for using double precision.
With a state vector of 7 for orientation only, the scratch space alone for the Kalman filter takes 1344 bytes. No extra data space is needed for more observations, just code. Cranking the state vector up to 13 (6 for translation and 7 for orientation) and double precision uses up 8736 bytes, almost 1/4 of the 32kiB available.
Have I mentioned how much I love the LPC today? I am at 187 of 500kiB of flash and 19 of 40 kiB (32kiB easily usable), so I am still using under 50% of the available space.
No comments:
Post a Comment