Why one should have a separate /boot partition: lessons learned

Background

The box (SPARC box to make matters more intresting) has two disks /dev/hda and /dev/hdb, they each have the same partion table appart from some free space at the end since two disks are never of the same size. All partitions are of type "RAID autodetec" and run indeed a RAID1 mirror. Several partitions exist:

  • /dev/md0 -> /
  • /dev/md1 -> swap
  • /dev/md2 -> /home
  • /dev/md3 -> /var
  • /dev/md4 -> /home

As disks are still disks we started having serious trouble with /dev/hda on the /var partition. Lots of errors, it started with I/O errors on /dev/hda that resulted in a breaking mirror. After investigation with SMART monitoring tools it seemed we first had 40, then 200-something unrecoverable errors. Trying to force SMART to repair these errors failed miserably, not sure why, maybe the spare sectors where all used (altho only 8 are reported to be used!). Anyway the disk is not in a healty condition.

Being cautious we decide to play it safe, / had enough space so we move /var there. Next we try to build a new filesystem with a bad block scan on /dev/md3. No way, Linux software RAID just doesn't like this and fails the /dev/hda partition. At that point we decide we'll leave /var just on the root filesystem for now untill we get round to buying a new hard drive.

Updating /etc/fstab and stopping the /dev/md3 RAID device, also zeroing the superblocs so mdadm doesn't try to assemble it at boot time is the next step. Now a simple reboot so we can be 100% sure everything is still fine.

The problem

After rebooting we only get SI from the SILO boot loader. WTF?? Boot from a Debian installation CD. Boot: rescue root=/dev/hda0 No luck. Whatever we try, no rescue boot works. So we get out the disk, attach it to another box, and run silo -r on it. Put it back and everything is fine.

What happened

When copying /var (a large chunk) to the root partition the filesytem driver or SMART will have decided that it is more efficient to move some of the already existing files. So it very funnily moved our /boot/second.b file around. That wasted about 4 hours of my time.

So the lesson learn is to always make a separate partition for /boot so problems like this don't occur. Now we only need to fiddle with the usable sectors of /dev/md3 to make a smaller partition with no errors on and move /boot there. But that won't be for today!