Friday, May 13, 2016

Hooray for distributed backups!

BACK UP ALL THE THINGS!
I found a backup of my database from April 21 of this year, which was only a couple of days before the failure. That should have all of my wikis and gallery data, and therefore represents the second-most important data I have. The most important is the code, and that is backed up by means of git. I know for sure that there is a valid git repository on one of my portable USB disks



Thursday, May 5, 2016

Yet Another Episode in the Annals of Data Stewardship

Having learned my lesson from before, I did not set up my filesystem as one big raid0. I did a btrfs raid5 instead. When one of the disks finally did give out, it wasn't with the click of death I heard before, but with read errors. The btrfs degraded, and by mounting read-only in recovery mode, I was able to use the two good disks in order to get my data.

Or so I thought.

A word on the issue I was having. I was seeing "stale file handle" warnings, of the type you see when you are in a folder that is NFS mounted, after you lose connection. But, this wasn't an NFS system. I rebooted the system and it wouldn't come up, because the btrfs refused to mount. After manually mounting in degraded mode, many of the disk accesses reported errors in dmesg, about the generation of certain metadata being off of the expected value, often by hundreds or thousands of generations.

First, I decided that I had lost confidence in btrfs -- if it wasn't going to keep working in the presence of a disk failure, what was the point? I spent the next several days scraping data off of the btrfs and putting it wherever I could find a place for it - on the USB disks I have, on other computers, on the system disk, etc. I then replaced the bad disk and formatted them all as zfs - now possible since Ubuntu 16.04 includes a native zfs driver.

Finally, I started copying data back onto the zfs. All appeared to go well, until I tried to bring up the wiki. The LocalSettings.php file was completely blank - it had the expected value, but all bytes in the file were 0x01 . Hrm.

Turns out a lot of files were like this. Files I care about, like the database, the git repositories, etc. It seems like the newer the file is, the more likely it is to be damaged like this.

No problem, I've got backups. A raid5 is not a backup, so I had the most important data copied off onto several other systems.

Or so I thought.

My backup script runs on a cron every night, and had backed up the bad data and spread it all around over the good data.

Oops.

It isn't a total loss. I have all my code in a git repository on the big USB disk. I have an old backup (from December, I think) of all the data I considered important. I did lose a lot of video :( but I don't think I lost anything from Florida 5.

So I think.