Monday, June 24, 2013

Bugs crossing Boundaries

I am getting the Rocketometer ready for delivery. The hardware appears to be fine, but the software had one particularly weird bug -- the name of the FAT volume on the SD card was changing. This meant that something was poking into the root directory block where it did not belong, which meant that it could be anything.

As it turned out, it was the packet buffer becoming full. Now how in the world did that corrupt the file system buffers? As it turned out, the file system by design wasn't using buffers, in order to save memory.

An SD card is addressed a sector at a time. Data can only be read and written in full sectors. So, if you want to change a single entry in the root directory or file allocation table, you have to read the whole sector into memory, change the bits you want, then write the whole sector back. Now the root directory and file allocation table only needs to be changed when the file is written, and in this case there is already a buffer which has just been used, so I foolishly decided to have the code use this.

In the mean time, this same buffer was part of the circular buffer accumulating packets. If the buffer ever filled up, new bytes would be rejected, and the head pointer would not move. If you start to write a packet to an already full buffer, nothing gets written, and when you finish, the packet length seems to be zero. Now the fun begins.

Upon finishing a CCSDS packet, the code counts how many bytes have been written, subtracts 7 because that's according to the protocol, then goes back and changes the buffer. In the current case, since no data had been written, it actually changed marked data.

Now we combine this with the fact that the interrupt task was filling the buffer. Here then is the final sequence:

  1. The buffer is full. No more bytes can be written.
  2. The main loop detects this, and starts to drain the buffer and write it to the SD card. It writes the packet, then reads the root directory sector, changes it, then writes it back.
  3. During the write, an interrupt fires.
  4. The interrupt task writes another packet. The bytes are dropped on the floor.
  5. The interrupt task finishes the packet. The calculated length is zero, since no new bytes were written
  6. The length to be written is -7, 0xFFF9
  7. The length is written to the buffer, but because of timing, it writes it over the top of the root directory sector in memory. The root directory is now corrupted.
  8. The interrupt task finishes
  9. The SD writing code writes the corrupted root directory sector back to the card.
  10. The card is now named RKT\FF\F9 instead of RKTO3.
It would be pretty easy for this same process to shoot holes anywhere in the filesystem metadata. This bug is hazardous!

So, to fix the bug, we do the following:

  1. Give the file system its own scratch buffer, and never let it do scratch work with the user buffer. This doesn't actually fix anything, but it does make the bug, and any similar undiscovered bug, less dangerous.
  2. Handle the full buffer case better. When the buffer is full, it throws away the new incoming bytes. Further, it backs the head pointer to the mid pointer, in effect throwing away the fractional packet accumulated before the buffer became full. Even further, since the buffer is now no longer full, to prevent getting the last half of a packet, we keep a flag that says that the buffer was full, and don't accept any more bytes until the flag is cleared by a drain operation.
  3. The interrupt task checks if the buffer is already full. If it is, we don't bother to read the sensors, a time-consuming chore which will just discard the result and take time away from the SD writing routines.
This clears a major bug, perhaps one causing all the observed halts. A bug which punches random holes in the metadata can do anything. It only triggered recently when I cranked the data recording rate to 200Hz, which appears to be just beyond the equipment capability. Previous rocket flights must have run at lower rate and not triggered the bug, or not triggered it as hard.

So how did I find it? Lots of print statements and blinklocks. First, I had the SD driver print out the buffer it was going to write. Then I had the direntry driver (the thing which actually handles the root directory) print out its scratch buffer after it had been read, as it changed each byte in turn, and before it wrote it out. From that I found out that the buffer was changing in between lines. That meant that an interrupt task was doing it. Further puzzling out and seeing 0xFFF9, then counting on my fingers to see what number that is in signed decimal, revealed that it was -7, a key number in calculating CCSDS packet sizes. However, it wasn't writing bytes 4 and 5, where it would be if the packet started at byte 0 in the buffer. Instead it wrote one byte before that, implying that the phantom packet started at -1, a key number in the circular buffer pointers. Once the head pointer is one less than the tail pointer, the buffer is full.

For a while I thought that the code was starting a new packet before it finished an old one, which would result in weird numbers. So, I put in some code which blinklocked the device if a packet had not been finished before the next one was started. That never hit. So, I put in some code which blinklocked when the circular buffer filled up, and that finally hit and proved it all. This interacted weirdly with the overlapping packet checker, so I had to fix a bug there. 

Now the only remaining known weakness is what we do if the first cluster of the root directory is full. We can make a cluster as big as 64kiB, we can handle multiple sectors in a cluster in the root directory, and we can fit 2ki entries into a cluster that big. But, what if we fill it? The file search code will need to know how to go to the next cluster, and the file creation code will need to know how to allocate a new cluster. Or, the format process can write 10000 files then delete them all.

The flight version of the code will not blinklock, but reset, when blinklock() is called. Between that and gathering enough runtime, I can gain some confidence that the device won't lock up.

I set the circular buffer size back to two sectors, down from the 6 I had given it. Next version will put the circular buffer into USB ram to save space for the stack, and allow me to have up to 16 sectors. USB memory doesn't work for some reason... maybe the USB system needs to be on for it? Then we are exchanging memory for power consumption. I know that the system is dropping packets now, more so than it was at 6 sectors.

No comments:

Post a Comment