woensdag 29 juni 2011

How to replace a failed disk of a RAID 5 array with mdadm on Linux

This is easy, once you know how it's done :-) These instructions were made on Ubuntu but they apply to many Linux distributions.

First of all, physically install your new disk and partition it so that it has the same (or a similar) structure as the old one you are replacing.

Then, install mdadm if you haven't already:
sudo apt-get install mdadm

Optional but might be a good idea if you can: reboot to make sure the whole md (multi-disk) software stack is loaded.

Let's check the status of our disks - mdadm should have discovered at least *something* on boot:

evy@evy-server:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdb2[1](S) sdc2[2](S)
3858200832 blocks

So this looks like it created one RAID device (/dev/md0) that consists of /dev/sdb2 and /dev/sdc2. We are need at least 3 disks for our RAID 5 so let's add another one:

evy@evy-server:~$ sudo mdadm --manage /dev/md0 --add /dev/sda3
mdadm: cannot get array info for /dev/md0

Hmm, this doesn't work because the RAID array in /dev/md0 is not active.
Let's activate it then:

evy@evy-server:~$ sudo mdadm --assemble /dev/md0
mdadm: no devices found for /dev/md0

Hmm, mdadm is not aware of the physical disks that /dev/md0 consists of. That's strange, because that information is clearly visible in /proc/mdstat, remember ?

A look at the man page reveals the --scan option, that tells mdadm to scan /proc/mdstat (and others) for RAID device information.

Let's try again:

evy@evy-server:~$ sudo mdadm --assemble /dev/md0 --scan
mdadm: /dev/md0 assembled from 2 drives - not enough to start the array while not clean - consider --force.

Ok, so we knew that - the RAID 5 is missing a disk.
But we need to activate it anyhow, to be able to add the missing disk !

So we force it:

evy@evy-server:~$ sudo mdadm --assemble /dev/md0 --scan --force
mdadm: SET_ARRAY_INFO failed for /dev/md0: Device or resource busy

Hmm, something's keeping our RAID array busy...
Let's just restart that RAID volume to fix that:

evy@evy-server:~$ sudo mdadm --stop /dev/md0
mdadm: stopped /dev/md0

And try again:

evy@evy-server:~$ sudo mdadm --assemble /dev/md0 --scan --force
mdadm: /dev/md0 has been started with 2 drives (out of 3).

Success !
Let's see what happened just now: 

evy@evy-server:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdb2[1] sdc2[2]
      3858200832 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]

Aha, there is movement !
The RAID 5 array is online, and the [_UU] flags show us that one disk is missing.

Here's some more info that demonstrates this nicely:


evy@evy-server:~$ sudo mdadm --detail  /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Mon Feb  7 18:27:14 2011
     Raid Level : raid5
     Array Size : 3858200832 (3679.47 GiB 3950.80 GB)
  Used Dev Size : 1929100416 (1839.73 GiB 1975.40 GB)
   Raid Devices : 3
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Apr 25 13:04:09 2011
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 90b591c9:2fd9536f:86e10ba8:fe4e4a37
         Events : 0.172

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       18        1      active sync   /dev/sdb2
       2       8       34        2      active sync   /dev/sdc2

So let's try to add that missing disk, which had been our intent all along.


evy@evy-server:~$ sudo mdadm --manage /dev/md0 --add /dev/sda3
mdadm: added /dev/sda3

And this works.
Check it:

evy@evy-server:~$ sudo mdadm --detail /dev/md0
/dev/md0:
Version : 00.90
Creation Time : Mon Feb 7 18:27:14 2011
Raid Level : raid5
Array Size : 3858200832 (3679.47 GiB 3950.80 GB)
Used Dev Size : 1929100416 (1839.73 GiB 1975.40 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Wed Jun 29 23:39:54 2011
State : clean, degraded, recovering
Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1

Layout : left-symmetric
Chunk Size : 64K

Rebuild Status : 0% complete

UUID : 90b591c9:2fd9536f:86e10ba8:fe4e4a37
Events : 0.176

Number Major Minor RaidDevice State
3 8 3 0 spare rebuilding /dev/sda3
1 8 18 1 active sync /dev/sdb2
2 8 34 2 active sync /dev/sdc2

The RAID has already started rebuilding!

We can watch the progress like this:

evy@evy-server:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sda3[3] sdb2[1] sdc2[2]
3858200832 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
[>....................] recovery = 0.0% (673180/1929100416) finish=429.6min speed=74797K/sec

Aaaaaaah, there's no greater time to go to bed than when your RAID 5 is recovering :-)

7 opmerkingen:

  1. thank you. you saved me.
    out of many tutorial out there of how to rebuild linux raid.
    this one really work on me.

    God bless you

    BeantwoordenVerwijderen
  2. Thanks. Probably went to 50 pages looking for how do to this before I found yours. It was perfect. You definitely made my day.

    BeantwoordenVerwijderen
  3. Do I need to do anything before I physically remove and replace the failed disk? Unmount it some how? The disk is not yet failed but I am getting messages saying it will fail.

    BeantwoordenVerwijderen