## Friday, April 15, 2011

### Efficiency of the Kalman Filter

So as I said before, my single-loop design doesn't have any hard scheduling limits or 1201 sensitivity like the Apollo computers. However, since my filter is run so much more frequently, it is more important for it to be quick.

So, in the first, I just did a textbook implementation of the filter, with no particular optimization. I follow the twin laws of optimization:

Law 1: Don't do it.
Law 2 (for advanced programmers only!): Don't do it yet.

This worked acceptably well with just the 7-element orientation filter. It semed to run at about 15ms, sometimes hitting its 10ms target, and sometimes missing. There was still about 45% idle time, suggesting that it was really close to 10ms.

But, features gradually accumulated, especially a whole spiderweb of pointers in an attempt to use object-oriented features in a non-object-oriented language. Also, I actually added some control code. This starting raising the average time to closer to 20ms.

Finally, I added the position Kalman filter. Without even updating the acceleration sensors at all, the time used jumped to well over 100ms, clearly excessive.

So,  I have done an operation count analysis of the filter. Since the filter depends so heavily on matrix multiplication, and since matrix multiplication is an O(m^3) task, we would expect that doubling the size of the state vector would multiply the time by 8, and this seems to be the case. However, 55% of the time in the un-optimized case, and a staggering 81% of the time in a slightly optimized case, is spent on the time update of the covariance. Here's the first form, with just the orientation state vector

 mul add div % of mul % of add % of div time usec % of time A=1 0 0 0.00% 0.00% 0.00% 0 0.00% A+=A*Phi*dt 392 343 30.27% 30.22% 0.00% 1374.705 30.21% x+=F*dt 7 7 0.54% 0.62% 0.00% 26.4096 0.58% P=APA'+Q 686 637 52.97% 56.12% 0.00% 2483.908 54.59% H=H 0 0 0.00% 0.00% 0.00% 0 0.00% HP=H*P 49 42 3.78% 3.70% 0.00% 169.9768 3.74% Gamma=HP*H'+R 7 7 0.54% 0.62% 0.00% 26.4096 0.58% K=(P/Gamma)H' 98 42 1 7.57% 3.70% 100.00% 255.1912 5.61% Z=z-g 0 1 0.00% 0.09% 0.00% 2.1272 0.05% X=KZ 7 0 0.54% 0.00% 0.00% 11.5192 0.25% x=x+X 0 7 0.00% 0.62% 0.00% 14.8904 0.33% P=P-KHP 49 49 3.78% 4.32% 0.00% 184.8672 4.06% 1295 1135 1 100.00% 100.00% 100.00% 4550.004 100.00% time cost 2.057 2.659 5.725 MHz factor 1.6456 2.1272 4.58 time usec 2131.052 2414.372 4.58 4550.004

There are a couple of blazingly obvious and not-quite-so-obvious optimizations we can do here. First, in A+=A*Phi*dt, I am multiplying by an identity matrix. That's kinda silly, but still costs m^3 operations. So, we optimize that bit.

Secondly, something has been bugging me for a while and I finally solved it. For Gamma, we need to calculate HPH'. Now we use HP as an intermediate result later, but we also use PH', and it bugged me to calculate both. Finally, I worked it out. If I keep HP, I am left with K=PH'/Gamma. But, HP=HP'=(PH')', so we can say that K'=(PH')/Gamma=HP/Gamma. All we need to do is transpose HP and copy it to the storage for K, multiplying each element by 1/Gamma as we do so.

This brings us to here:

 mul add div % of mul % of add % of div time % of time 0.00% 0.00% 0.00% 0 0.00% A=Phi*dt+diag(1) 49 7 5.69% 0.98% 0.00% 95.5248 3.25% x+=F*dt 7 7 0.81% 0.98% 0.00% 26.4096 0.90% P=APA'+diag(Q) 686 595 79.67% 83.22% 0.00% 2394.566 81.38% H=H 0 0 0.00% 0.00% 0.00% 0 0.00% HP=H*P 49 42 5.69% 5.87% 0.00% 169.9768 5.78% Gamma=HP*H'+R 7 7 0.81% 0.98% 0.00% 26.4096 0.90% K=(HP)'/Gamma 7 0 1 0.81% 0.00% 100.00% 16.0992 0.55% Z=z-g 0 1 0.00% 0.14% 0.00% 2.1272 0.07% X=KZ 7 0 0.81% 0.00% 0.00% 11.5192 0.39% x=x+X 0 7 0.00% 0.98% 0.00% 14.8904 0.51% P=P-KHP 49 49 5.69% 6.85% 0.00% 184.8672 6.28% 861 715 1 100.00% 100.00% 100.00% 2942.39 100.00% time cost 2.057 2.659 5.725 MHz factor 1.6456 2.1272 4.58 1416.862 1520.948 4.58 2942.39

This represents a 35% improvement already. Woohoo.

That time update of covariance is still a problem. What if we just don't do it? You may think I am joking, but this can actually be done in some problems, and is called offline gain calculation. In some problems, the value of P converges on a final answer. When it does so, and if H is also constant, K also converges. What is done in this case is to just calculate the final values of P and K from the problem, and don't even do them in the loop.

Unfortunately, this is usually only possible in a linear problem, which I don't think the IMU filter is. I'm certainly not treating it as linear. But hopefully P may change slowly enough that we don't need to do a full update every time. We accumulate A over several time updates, and only use it to update P one every in-a-while. We are already not doing the time update if the time difference is zero, so the 80% of this time is not done as often as one might think. This change just does the expensive half of the time update even less often.

Also, it seems like there must be a way to optimize the APA' operation. It just has so much symmetry. Maybe there is a way to calculate only one triangle of this