apocryph.org Notes to my future self

3Dec/050

More trouble in paradise: software RAID controllers can suck

Yesterday while I was at work, there was a brief power fluctuation in my townhouse. Since I’m still setting up aenea, she isn’t yet in my server closet, or hooked up to an UPS. So, predictably, she lost power.

This is somewhat bad, since the Highpoint RocketRaid 2220 SATA RAID controller that powers her 1TB RAID 5 disk array does not deal at all well with unorderly shutdowns, since the RAID logic is implemented in a software driver, not hardware.

Predictably, I suffered some file system damage. I now can’t boot, because /var seems sufficiently damaged to cause a panic in some ffs_whatever module. Thankfully it was /var and not, say, /usr, but nonetheless it sucks badly.

I’ve booted the FixIt shell on the FreeBSD 6.0 install disc, and loaded the hptmv6.ko kernel module from a USB floppy, so now I’m hoping I can fsck the problem away from this shell.

First, I’m discovering that a standard fsck in the FixIt shell doesn’t recognize the /var filesystem. fsck_ufs does the trick, but when I run it with fsck_ufs /dev/da0s1d it just outputs the file system errors and calls it a day; it doesn’t fix them.

Hmm, fsck doesn’t work because it’s looking in /sbin and /usr/sbin for the fsck_* executables, but in the FixIt environment they’re in /mnt2/usr/sbin. The FixIt shell is just flaky; sometimes I’ll run a command (ls, fsck, mount, man; it doesn’t matter what) and it hangs. Over on VTTY 2 (Alt-F2) I see about 15 timeout errors from acd0 before the shell finally comes back, only to hang again on my next command.

Fortunately, I’ve read on the lists that the first thing to try when a file system is fucked is to boot in single user mode (option 4 on the boot menu iirc). That boots find and gets me to a shell prompt.

I run

fsck -p /dev/da0s1d

Where -p is preen mode, which from the man page I gather checks for minor inconsistencies, but won’t handle major problems. All the list posts I see use this first.

From this I get:

/dev/ds0s1d: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY

I gather that’s bad. I found a few things on the list:

First, this frightful message advocating I use vi on the directory to remove invalid file entries. Um, no.

Next, this USENIX paper on FreeBSD soft updates, which explains what they’re for (to allow fsck to run whilst the file system is mounted, for speedier recovery), and when it doesn’t work (when the soft update snapshot is inconsistent, eg on power failure or crash).

So, with little help from the ‘net, I went ahead with:

fsck /dev/da0s1d

And got the UNEXPECTED SOFT UPDATE INCONSISTENCY, this time with a prompt: REMOVE? [yn]. I’m going to go with ‘yes’ and hope for the best…another error, this one UNREF FILE. The prompt is RECONNECT? [yn]. I’ll go with ‘yes’ again. Another ‘yes’ to the NO lost+found DIRECTORY CREATE?

A ton more UNREF FILE msgs; ‘yes’ each time.

FREE BLK COUNT(S) WRONG IN SUPERBLK. SALVAGE? [yn] Most definitely.

SUMMARY INFORMATION BAD. SALVAGE? [yn] Sure, go ahead.

BLKS MISSING IN BIT MAPS. SALVAGE? [yn] Yeah, if you want…

And then, as if nothing had happened, FILE SYSTEM MARKED CLEAN. Yay.

Now I do:

fsck -p

Do do a preening check on all the file systems. A few minor errors on /dev/da0s1f and /dev/da0s1e, but nothing fsck couldn’t handle on its own. Took a long time to scan the huge ~900GB partition…

Done now. I’ll exit this shell and proceed with the boot process, hoping for the best.

Voila! Booted fine.

So the moral(s) of the story are:

  • When using a software RAID driver, you mustn’t let the power go out
  • When using a BSD UFS file system, you mustn’t let the power go out
  • When using UNIX in general, you mustn’t let the power go out

It’s hard for me to get used to this, as the bulk of my computer hours have been spent on Windows, where I’ve forced shutdowns countless times, and never had any serious file system damage. Needless to say, aenea is going on an UPS right now.

UPDATE: aenea sucks so much power she overloads the UPS I have on prospertine. I’ll have to move her into the server closet early, just so she’ll have an available UPS.

Delicious Bookmarks

Recent Posts

Meta

Current Location