Wednesday, August 15, 2012

Resurrecting Omoikane

So, as you may or may not know, I ran a nice little Linux server with all my data on it, including a filesystem dating back to at least 2003 with files back to 1999. I used LVM to spread the file system across all the drives I had, so that I didn't worry about which file was on which drive. I let the filesystem driver handle that.

Well, a couple of months ago, one of the drives in Omoikane started emitting this terrible shaking noise, which panicked the kernel. When I restarted the system, that drive was dead.

So, now I get to learn more than I cared to about the LVM and ext4 filesystems, in order to recover what I can from the good drive. As it happens to turn out, the system was in five "stripes", continuous blocks of LVM extents. Three of them, including the first one, are on the good disk, representing 2TB of the total 3.5TB system.

First thing is to write a really primitive LVM driver. I used the LVM tools to get a map of the stripes, then hard coded that map into my recovery program. This means that my program is not directly applicable to your problem, if you stumbled across this looking for LVM-saving hints. But, I did learn something about LVM: It uses the first 384 sectors (512 bytes each, 192kiB total) of each physical volume to record info about the entire LVM system the volume is participating in. This means that if any drive is still good, I can use it to reconstruct the structure, and find out which stripes I have and which I don't.

After the LVM header, each physical volume is just a trackless waste of unstructured bytes. The stripe information in the LVM header is needed just to see the order of the LVM extents on the physical volume. This is actually good, as it means that I don't have to interpret anything in the sea of data, at least in an LVM sense. To find the Nth extent of a logical volume, use the stripe map to get which extent of which stripe on which drive, then seek to 384*512+M*65536*512 to get to that extent, where M is the physical extent number you get from the stripe map.

Next it's on to writing a really primitive ext4 driver. Ext4 is rather sophisticated, in that it keeps track of a lot of data and uses complicated algorithms to decide where and when to write what. The good news is that Ext4 is a straightforward extension of Ext3, which again is an extension of Ext2, which was quite a bit simpler. Because it makes an effort at backward compatibility, much of Ext4 is readable with the simpler Ext2 logic.

For instance: Ext4 is divided up into block groups, same as Ext2. Each block group has an inode table and some bitmaps to help the full-blown driver allocate things quickly and optimally. Some block groups, always including the first, include a superblock with information about the entire file system, and a table of block group descriptors. Each block group table contains descriptors for all the block groups. So, in the first block group, we find a superblock, descriptors of all the block groups, and an inode table which lets us start finding files.

Now here's the clever bit. For a variety of reasons, pointers in different structures may point to blocks in other groups. That's ok, because all block pointers are absolute, meaning that they are all relative to the beginning of the filesystem. In one of the ext4 sophistications, it groups block groups into metagroups, and combines the inode tables and bitmaps from several block groups into one contiguous stream. For instance on Omoikane, 16 contiguous block groups made up a metagroup, so that the inodes for all 16 groups are in the first group. My code doesn't care, because the inode pointer in the block group descriptor points to the right place inside the inode metatable.

Another clever bit is in the directory structure. As is typical with a Unix filesystem, all the important information about a file is in the inode, including the file's length, permissions, owner, and block map. Everything you need to read a file, in other words. The directory only contains the name and an inode index. This is how hard linking is implemented: If two directory entries point to the same inode, then the file has two hard links, and each entry has equal claim to being the true name of the file. The directory entries don't even have to be in the same directory.

Ext2 just searched each directory entry linearly. Ext4 has the option of indexing the directory file, but it is done in such a way that Ext2 logic will completely ignore the index. The index data is actually in the directory entry for '..' after the file name.

In ext2, the closest thing to a "file allocation table" analogous to FAT filesystems is the block map. This map starts in the inode, which has the index for the first 12 blocks used by the file. Files of up to 48kiB are accomodated thusly. If the file takes more than 12 blocks, the 13th entry in the index points to another data block, the indirect block, which is completely full of pointers to the actual data blocks, allowing 4kiB*1ki blocks=4MiB more data, with the cost of 4kiB extra index data. For larger still files, we have the 14th entry which points to the double indirect block. Each pointer in this block points to another block full of pointers to the actual data blocks. 4kiB*1ki blocks*1ki blocks=4GiB more data, at the cost of 4MiB+4kiB more index data. Similarly the 15th entry points to the triple indirect block, which allows 4TiB more data at the cost of 4GiB+4MiB+4kiB of index data. Each step means about 0.1% overhead in storing a large file. Larger block sizes make the indirect blocks able to hold more pointers, so the level factor is more than 1024.

However, in ext4, a new block index called an extent tree (not the same as LVM extents) is used. The block map tree is fairly complicated, and needs to mark every block a file uses. Extents basically run-length compresses this by using one extent entry for each contiguous stream of blocks the file uses.

No comments:

Post a Comment