Friday, May 13, 2016

Hooray for distributed backups!

I found a backup of my database from April 21 of this year, which was only a couple of days before the failure. That should have all of my wikis and gallery data, and therefore represents the second-most important data I have. The most important is the code, and that is backed up by means of git. I know for sure that there is a valid git repository on one of my portable USB disks

Thursday, May 5, 2016

Yet Another Episode in the Annals of Data Stewardship

Having learned my lesson from before, I did not set up my filesystem as one big raid0. I did a btrfs raid5 instead. When one of the disks finally did give out, it wasn't with the click of death I heard before, but with read errors. The btrfs degraded, and by mounting read-only in recovery mode, I was able to use the two good disks in order to get my data.

Or so I thought.

A word on the issue I was having. I was seeing "stale file handle" warnings, of the type you see when you are in a folder that is NFS mounted, after you lose connection. But, this wasn't an NFS system. I rebooted the system and it wouldn't come up, because the btrfs refused to mount. After manually mounting in degraded mode, many of the disk accesses reported errors in dmesg, about the generation of certain metadata being off of the expected value, often by hundreds or thousands of generations.

First, I decided that I had lost confidence in btrfs -- if it wasn't going to keep working in the presence of a disk failure, what was the point? I spent the next several days scraping data off of the btrfs and putting it wherever I could find a place for it - on the USB disks I have, on other computers, on the system disk, etc. I then replaced the bad disk and formatted them all as zfs - now possible since Ubuntu 16.04 includes a native zfs driver.

Finally, I started copying data back onto the zfs. All appeared to go well, until I tried to bring up the wiki. The LocalSettings.php file was completely blank - it had the expected value, but all bytes in the file were 0x01 . Hrm.

Turns out a lot of files were like this. Files I care about, like the database, the git repositories, etc. It seems like the newer the file is, the more likely it is to be damaged like this.

No problem, I've got backups. A raid5 is not a backup, so I had the most important data copied off onto several other systems.

Or so I thought.

My backup script runs on a cron every night, and had backed up the bad data and spread it all around over the good data.


It isn't a total loss. I have all my code in a git repository on the big USB disk. I have an old backup (from December, I think) of all the data I considered important. I did lose a lot of video :( but I don't think I lost anything from Florida 5.

So I think.

Monday, April 4, 2016

Hearing but not Understanding

I just heard a conversation drift over the walls of my cube. I could identify the speakers, I could recognize their voices, but I couldn't understand it. It was as if I couldn't parse spoken English. What was actually happening was that, in between the noise level of the fan in my cube, the sound insulation in the cube partitions, and the low level of the conversation to begin with, I just couldn't make it out.

But then how was it that I was able to identify the voices and put names to them, when I couldn't parse them? It means that at some level, identifying voices is easier and more noise-resistant than picking words out.

Or it means that the spoken English section of my brain is broken. I have neither spoken nor heard anyone speak since then, a few minutes ago.

Thursday, March 31, 2016

Check PCLK measurement

Check if the user code measures PCLK properly. If it doesn't, then the baud rate calculation will be wrong. Since one of the symptoms that has been seen is that the RX light on the FT232 doodad flickers, but no characters appear in putty, it is possible that the baud rate isn't what we think it is.

Consider also calculating the baud rate registers and stuffing them manually. If this works, then it's the PCLK stuff that is broken.

Friday, March 11, 2016

The Secret to Success

  1. Pick something you like doing.
  2. Do it and do it and do it until you don't like doing it any more. This will always happen at some point.
  3. Keep doing it.
Following these steps don't guarantee success, but failing to follow them guarantees failure.

Tuesday, February 16, 2016

Cortex M4 FPU

For a while I was having trouble getting my part to print any FPU calculations. Finally it occurred to me that maybe the FPU has to be turned on, and that the ISP wasn't doing it since it didn't use it.

It turns out that you DO need to turn on the FPU:

4.6.6 Enabling the FPU
The FPU is disabled from reset. You must enable it before you can use any floating-point instructions. Example 4-1shows an example code sequence for enabling the FPU in both privileged and user modes. The processor must be in privileged mode to read from and write to the CPACR.
Example 4-1 Enabling the FPU
; CPACR is located at address 0xE000ED88
LDR.W R0, =0xE000ED88
; Read CPACR
LDR R1, [R0]
; Set bits 20-23 to enable CP10 and CP11 coprocessors
ORR R1, R1, #(0xF << 20)
; Write back the modified value to the CPACR
STR R1, [R0]; wait for store to complete
;reset pipeline now the FPU is enabled

In effect, the FPU is counted as coprocessor 10 and 11. Cortex-M doesn't fully support the concept of coprocessors, but it does in this context. We allow full unpriveleged access to coprocessors 10 and 11.

I don't know about waiting for the store to complete and resetting the pipeline. I just put some C++ code to do this long before the FPU is used, and let a bunch of other work instructions flush the pipeline.

Friday, February 12, 2016

Getting on board the LPC4078

A few things make the LPC4078 dramatically different from the LPC2148 that I am used to:

  1. This is a Cortex-M4, with a much different interrupt and reset vector table. Instead of a set of ldr pc,[pc,#24] instructions followed by addresses 24 bytes later, we have just a table of addresses. The first one is the value to put into the stack pointer upon reset (so there is no required stack setup code) and the second one is the value to put in PC on reset. Subsequent values include exception and interrupt handler addresses.
  2. As with the 2148, there is a bootstrap program. On reset, the bootstrap vector table is mapped to 0x00000000, rather than the more obvious solution of having the reset value of the vector table address register point at the bootstrap.
One of the ugly things about my old LPC2148 is that much of it was Not Invented Here. Since this is a hobby project, I can use as much of my time to do whatever I feel like, including reinvent wheels! So, I get to figure out how to start up a Cortex-M4 from scratch.

I'm not there yet.

I haven't gotten my code to run directly yet, so there is still a problem with my vector table. But, I can get my code to run with the help of the ISP, after some experimentation.

The ISP maps a total of 512 (0x200) bytes of memory as its vector table. This is room for 128 vectors, while the actual number of vectors it uses is only 7. I can tell because the reset vector points at what would be in slot 7. In any case, this code is mapped to address 0 at startup. When I was using the ISP to launch my code, I originally had only exactly enough memory reserved to hold my table, and my code started immediately after, at address 0xE4 as it happened. Well, that code was covered up by the bootstrap. After rearranging my program so that a full 512 bytes was allocated for the table, things worked much better.

Also, the ISP uses an autobaud feature when it sets up the serial port. You send a question mark, and it times the bits on that to set up thebaud rate registers. However, the ISP uses a feature I don't, and that is the fractional baud rate register. This is able to tune the baud rate at a relatively fine grain to between 1x and 2x the rate called for by the coarse baud rate registers. When the ISP is used to kick off my code, my code sets the coarse baud rate, but didn't touch the fractional baud rate register, which was left at about 1.5 . Therefore the part was programmed to talk at the wrong rate by 50%, out of tolerance for the serial port. The FT232 could tell that the part was talking, but couldn't understand any of it.

With those two things taken care of, I can now start my code with the ISP, from the very beginning of my code. Next step is to see what I am doing wrong such that my code doesn't start itself.