2014

JanFebMarApr
MayJunJulAug
SepOctNovDec

2013

JanFebMarApr
MayJunJulAug
SepOctNovDec

more...

2011

JanFebMarApr
MayJunJulAug
SepOctNovDec

2010

JanFebMarApr
MayJunJulAug
SepOctNovDec

2009

JanFebMarApr
MayJunJulAug
SepOctNovDec

2008

JanFebMarApr
MayJunJulAug
SepOctNovDec

2007

JanFebMarApr
MayJunJulAug
SepOctNovDec

2006

JanFebMarApr
MayJunJulAug
SepOctNovDec

2005

JanFebMarApr
MayJunJulAug
SepOctNovDec

2004

JanFebMarApr
MayJunJulAug
SepOctNovDec

2003

JanFebMarApr
MayJunJulAug
SepOctNovDec

Photolog

Through the Looking-Glass
2010-10-12: Through the Looking-Glass
My radio speaks is binary!
2010-10-10: My radio speaks is binary!
Gigaminx: (present for my birthday)
2010-09-16: Gigaminx: (present for my birthday)
Trini on bike
2010-09-05: Trini on bike
Valporquero
2010-08-28: Valporquero
My new bike!
2010-08-22: My new bike!
Mario and Ana's wedding
2010-08-13: Mario and Ana's wedding
Canyoning in Guara
2010-08-07: Canyoning in Guara
Trini and Mari in Marbella
2010-08-05: Trini and Mari in Marbella
Trini and Chelo in Tabarca
2010-08-03: Trini and Chelo in Tabarca
Valid XHTML 1.1
Log in
Labels: work hardware RAID

In our company, we have a Back-up solution which stores all the information in a set of disks, configured in a RAID5 array, and served by a Coraid SR1521 server.

Everything has always worked fine... until today. One of the disks in the RAID5 array failed, so the system decided to replace it with a spare one, and begun recovering. But, apparently, another one failed too while the recovery was taking place:

SR shelf 1> list -l
 1 6001.229GB online
  1.0   6001.229GB raid5 failed
    1.0.0  normal   1000.205GB 1.0
    1.0.1  failed   1000.205GB 1.2
    1.0.2  normal   1000.205GB 1.4
    1.0.3  normal   1000.205GB 1.6
    1.0.4  replaced 1000.205GB 1.14
    1.0.5  normal   1000.205GB 1.10
    1.0.6  normal   1000.205GB 1.12

So, we had a RAID5 array without two of their disks, and the array stopped working. Let's try to recover it... first, I'll bring the disk offline, so it cannot be accessed from the outside:

SR shelf 1> offline 1
SR shelf 1> list -l
 1 6001.229GB offline
  1.0   6001.229GB raid5 failed
    1.0.0  normal   1000.205GB 1.0
    1.0.1  failed   1000.205GB 1.2
    1.0.2  normal   1000.205GB 1.4
    1.0.3  normal   1000.205GB 1.6
    1.0.4  replaced 1000.205GB 1.14
    1.0.5  normal   1000.205GB 1.10
    1.0.6  normal   1000.205GB 1.12

A RAID5 array cannot work without two of its disks, so I will tell the system to ignore the failure of the second drive, and see if it works:

SR shelf 1> unfail 1.0.1
error: raid 1.0 is failed

No, it doesn't... at least, not while the disk is part of a failed RAID. Well, I will re-generate the array telling the system that disk disk is OK:

SR shelf 1> restore -l
Reading config information from drives ... done.
make -r 1 raid5 1.0 1.2:f 1.4 1.6 1.14:r 1.10 1.12
online 1
SR shelf 1> eject 1
Ejecting lbalde(s): 1
SR shelf 1> list -l
SR shelf 1> make -r 1 raid5 1.0 1.2 1.4 1.6 1.14:r 1.10 1.12
beginning recovery of disk 1.0.4
SR shelf 1> list -l
 1 6001.229GB offline
  1.0   6001.229GB raid5 recovering,degraded  0.03%
    1.0.0  normal   1000.205GB 1.0
    1.0.1  normal   1000.205GB 1.2
    1.0.2  normal   1000.205GB 1.4
    1.0.3  normal   1000.205GB 1.6
    1.0.4  replaced 1000.205GB 1.14
    1.0.5  normal   1000.205GB 1.10
    1.0.6  normal   1000.205GB 1.12

Well, apparently it is working! The RAID is recovering, and the first replaced disk is being filled again. Until...:

aborted recovery of disk 1.0.4
raidshield: unrecoverable error on disk 1.0.1 at block 6496009
raid device 1.0.1 has failed
unrecoverable failure on raid 1.0
SR shelf 1> list -l
 1 6001.229GB offline
  1.0   6001.229GB raid5 failed
    1.0.0  normal   1000.205GB 1.0
    1.0.1  failed   1000.205GB 1.2
    1.0.2  normal   1000.205GB 1.4
    1.0.3  normal   1000.205GB 1.6
    1.0.4  replaced 1000.205GB 1.14
    1.0.5  normal   1000.205GB 1.10
    1.0.6  normal   1000.205GB 1.12

The same disk failed again, at block 6496009, while the server was recovering and calculating the disks parity. But, the good news are that it didn't fail until the recovery was more than 50% complete, and all the information I care is in the first 20% of all the disks. So, I should be able to recover it... somehow :)

SR shelf 1> eject 1
SR shelf 1> make -r 1 raid5 1.0 1.2 1.4 1.6 1.8:f 1.10 1.12
no spare large enough for 1.0.4
SR shelf 1> list -l
 1 6001.229GB offline
  1.0   6001.229GB raid5 degraded
    1.0.0  normal   1000.205GB 1.0
    1.0.1  normal   1000.205GB 1.2
    1.0.2  normal   1000.205GB 1.4
    1.0.3  normal   1000.205GB 1.6
    1.0.4  failed   1000.205GB 1.8
    1.0.5  normal   1000.205GB 1.10
    1.0.6  normal   1000.205GB 1.12
SR shelf 1> online 1

I have access to my disks again, at least if I do not try to read that failed block. Next step: Copy all the information somewhere else, and init a new RAID5 array. Or, better, a RAID6 one :-)