In our company, we have a Back-up solution which stores all the information in a set of disks, configured in a RAID5 array, and served by a Coraid SR1521 server.
Everything has always worked fine... until today. One of the disks in the RAID5 array failed, so the system decided to replace it with a spare one, and begun recovering. But, apparently, another one failed too while the recovery was taking place:
SR shelf 1> list -l
1 6001.229GB online
1.0 6001.229GB raid5 failed
1.0.0 normal 1000.205GB 1.0
1.0.1 failed 1000.205GB 1.2
1.0.2 normal 1000.205GB 1.4
1.0.3 normal 1000.205GB 1.6
1.0.4 replaced 1000.205GB 1.14
1.0.5 normal 1000.205GB 1.10
1.0.6 normal 1000.205GB 1.12
So, we had a RAID5 array without two of their disks, and the array stopped working. Let's try to recover it... first, I'll bring the disk offline, so it cannot be accessed from the outside:
SR shelf 1> offline 1
SR shelf 1> list -l
1 6001.229GB offline
1.0 6001.229GB raid5 failed
1.0.0 normal 1000.205GB 1.0
1.0.1 failed 1000.205GB 1.2
1.0.2 normal 1000.205GB 1.4
1.0.3 normal 1000.205GB 1.6
1.0.4 replaced 1000.205GB 1.14
1.0.5 normal 1000.205GB 1.10
1.0.6 normal 1000.205GB 1.12
A RAID5 array cannot work without two of its disks, so I will tell the system to ignore the failure of the second drive, and see if it works:
SR shelf 1> unfail 1.0.1
error: raid 1.0 is failed
No, it doesn't... at least, not while the disk is part of a failed RAID. Well, I will re-generate the array telling the system that disk disk is OK:
SR shelf 1> restore -l
Reading config information from drives ... done.
make -r 1 raid5 1.0 1.2:f 1.4 1.6 1.14:r 1.10 1.12
online 1
SR shelf 1> eject 1
Ejecting lbalde(s): 1
SR shelf 1> list -l
SR shelf 1> make -r 1 raid5 1.0 1.2 1.4 1.6 1.14:r 1.10 1.12
beginning recovery of disk 1.0.4
SR shelf 1> list -l
1 6001.229GB offline
1.0 6001.229GB raid5 recovering,degraded 0.03%
1.0.0 normal 1000.205GB 1.0
1.0.1 normal 1000.205GB 1.2
1.0.2 normal 1000.205GB 1.4
1.0.3 normal 1000.205GB 1.6
1.0.4 replaced 1000.205GB 1.14
1.0.5 normal 1000.205GB 1.10
1.0.6 normal 1000.205GB 1.12
Well, apparently it is working! The RAID is recovering, and the first replaced disk is being filled again. Until...:
aborted recovery of disk 1.0.4
raidshield: unrecoverable error on disk 1.0.1 at block 6496009
raid device 1.0.1 has failed
unrecoverable failure on raid 1.0
SR shelf 1> list -l
1 6001.229GB offline
1.0 6001.229GB raid5 failed
1.0.0 normal 1000.205GB 1.0
1.0.1 failed 1000.205GB 1.2
1.0.2 normal 1000.205GB 1.4
1.0.3 normal 1000.205GB 1.6
1.0.4 replaced 1000.205GB 1.14
1.0.5 normal 1000.205GB 1.10
1.0.6 normal 1000.205GB 1.12
The same disk failed again, at block 6496009, while the server was recovering and calculating the disks parity. But, the good news are that it didn't fail until the recovery was more than 50% complete, and all the information I care is in the first 20% of all the disks. So, I should be able to recover it... somehow :)
SR shelf 1> eject 1
SR shelf 1> make -r 1 raid5 1.0 1.2 1.4 1.6 1.8:f 1.10 1.12
no spare large enough for 1.0.4
SR shelf 1> list -l
1 6001.229GB offline
1.0 6001.229GB raid5 degraded
1.0.0 normal 1000.205GB 1.0
1.0.1 normal 1000.205GB 1.2
1.0.2 normal 1000.205GB 1.4
1.0.3 normal 1000.205GB 1.6
1.0.4 failed 1000.205GB 1.8
1.0.5 normal 1000.205GB 1.10
1.0.6 normal 1000.205GB 1.12
SR shelf 1> online 1
I have access to my disks again, at least if I do not try to read that failed block. Next step: Copy all the information somewhere else, and init a new RAID5 array. Or, better, a RAID6 one :-)