Coffee Space


Listen:

Software RAID Part 2

Preview Image

See the previous article first.

TL;DR

There was not supposed to be a part 2. It all went to hell after a reboot. I did some work to bring it back from the dead and things now seem to be okay.

Force Reboot

I mostly just suspend my laptop for daily work, I have tonnes of windows open over three displays in Ubuntu and it’s ridiculous to keep setting these up each time. Suspending is generally flaky in Linux - it seems to be because ACPI is a bit of a mess and things don’t always come back up after being put into a low-power state. I somehow got this working reliably on my machine after much messing around… But. On a rare occasion, the machine wakes itself up from suspend. The result is that if I’m out, the laptop drains it’s battery (completely) and dies hard.

This was such an occasion. Much annoyed. After a moment of questioning my sanity, I figure it needs to be cold-booted. I get into Ubuntu and start my file systems (they don’t auto-mount, don’t @ me). Second drive… Okay. RAID drives… Nothing.

Where You Gone?

0001 $ ls /dev/md*
0002 $

Wot? There is no software RAID device. Dear lord.

I start Googling and running the various commands sudo mount /dev/md10 /mnt/md10 - nothing. cat /proc/mdstat - nothing. sudo mdadm --examine /dev/sd[cdef] - no RAID devices found… This went on for a while.

I though “okay, the hard drives are still there, what is that the system sees?”. I open the Ubuntu Disks utility, four drives in there all ready with no partition. Display as free.

I read several places online, there doesn’t seem to be any recovery if you can’t detect the drive (i.e. /dev/md10). Ahhhhh! There wasn’t really important data on there, but it was a massive pain to download on my potato network.

Wild Card

There was a single place that suggested a possible fix, but it was going to be pretty much chance as to whether it worked. Some of the key highlights are here:

It’s dangerous to use full disk for anything other than a partition table. As soon as anything else writes a partition table, your full disk RAID / LUKS / LVM / filesystem metadata is gone. And user error aside, there are a lot of tools and circumstances out there that may write a partition table without really asking you.

This is a new one on me! Full disk RAID was the default freaking option, why is the default the least safe?!

Thus, your only hope here is to re-create the RAID to build new metadata from scratch.

And this is the unfortunate news… If there is to be any hope of recovery:

And it must be re-created exactly the same way it was, so if you’re sure you used full disk and not partitions, you must re-create it with full disks, and in the right order too. Think about migrating to partition instead of full disk devices when you have your data back.

We really only get one shot at this. One this that works in my favour is that if I restart my drives, they are powered on one-by-one - I assume to reduce the rush current from getting the disks spinning. This means that there was a good chance that this was also the order that I attached them too, so I just needed to get the drive order correct to have a chance.

0003 sudo mdadm /dev/md10 --create --assume-clean --level=10 -n 4 \
0004   /dev/sdc /dev/sdd /dev/sde /dev/sdf

I waited for a moment with baited breath… Success! I think mounted the drive sudo mount /dev/md10 /mnt/md10. I backed up everything whilst I still could.

Perma-Fix

As suggested, it seems to be the GPT partition table causing the trouble. It seems some part of my Linux setup takes it upon itself to overwrite the data on the disk with what it thinks to be correct. How the hell is that default behaviour?!

Anyway, to do the fix you need to do the following:

0005 sudo umount /mnt/md10                       # Unmount the RAID
0006 sudo mdadm --stop --scan                    # Stop the RAID drive
0007 sudo wipefs --all --types gpt,PMBR /dev/sdc # Wipe GPT
0008 sudo wipefs --all --types gpt,PMBR /dev/sdd # Wipe GPT
0009 sudo wipefs --all --types gpt,PMBR /dev/sde # Wipe GPT
0010 sudo wipefs --all --types gpt,PMBR /dev/sdf # Wipe GPT
0011 sudo mount /dev/md10 /mnt/md10              # Remount for use

Lessons

The default instructions for software RAID seem to almost all be wrong. The default full disk method is simply dangerous. I imagine tonnes of people have lost their data as a result, it wasn’t exactly easy to find any solution and in some respects I was lucky that my drives boot in a reliable way - otherwise I might have needed to try 4! (4x3x2x1 = 24) different combinations.

I’ll update if anything goes wrong. It might be a while as I generally try to avoid rebooting, I have work to do after all.