Coraid SAN RAID5 disks failed

2014

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

2013

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

more...

2011

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

2010

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

2009

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

2008

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

2007

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

2006

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

2005

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

2004

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

2003

Jan	Feb	Mar	Apr
May	Jun	Jul	Aug
Sep	Oct	Nov	Dec

Photolog

2010-10-12: Through the Looking-Glass

2010-10-10: My radio speaks is binary!

2010-09-16: Gigaminx: (present for my birthday)

2010-09-05: Trini on bike

2010-08-28: Valporquero

2010-08-22: My new bike!

2010-08-13: Mario and Ana's wedding

2010-08-07: Canyoning in Guara

2010-08-05: Trini and Mari in Marbella

2010-08-03: Trini and Chelo in Tabarca

Labels: work hardware RAID

In our company, we have a Back-up solution which stores all the information in a set of disks, configured in a RAID5 array, and served by a Coraid SR1521 server.

Everything has always worked fine... until today. One of the disks in the RAID5 array failed, so the system decided to replace it with a spare one, and begun recovering. But, apparently, another one failed too while the recovery was taking place:

SR shelf 1> list -l
 1 6001.229GB online
  1.0   6001.229GB raid5 failed
    1.0.0  normal   1000.205GB 1.0
    1.0.1  failed   1000.205GB 1.2
    1.0.2  normal   1000.205GB 1.4
    1.0.3  normal   1000.205GB 1.6
    1.0.4  replaced 1000.205GB 1.14
    1.0.5  normal   1000.205GB 1.10
    1.0.6  normal   1000.205GB 1.12

So, we had a RAID5 array without two of their disks, and the array stopped working. Let's try to recover it... first, I'll bring the disk offline, so it cannot be accessed from the outside:

SR shelf 1> offline 1
SR shelf 1> list -l
 1 6001.229GB offline
  1.0   6001.229GB raid5 failed
    1.0.0  normal   1000.205GB 1.0
    1.0.1  failed   1000.205GB 1.2
    1.0.2  normal   1000.205GB 1.4
    1.0.3  normal   1000.205GB 1.6
    1.0.4  replaced 1000.205GB 1.14
    1.0.5  normal   1000.205GB 1.10
    1.0.6  normal   1000.205GB 1.12

A RAID5 array cannot work without two of its disks, so I will tell the system to ignore the failure of the second drive, and see if it works:

SR shelf 1> unfail 1.0.1
error: raid 1.0 is failed

No, it doesn't... at least, not while the disk is part of a failed RAID. Well, I will re-generate the array telling the system that disk disk is OK:

SR shelf 1> restore -l
Reading config information from drives ... done.
make -r 1 raid5 1.0 1.2:f 1.4 1.6 1.14:r 1.10 1.12
online 1
SR shelf 1> eject 1
Ejecting lbalde(s): 1
SR shelf 1> list -l
SR shelf 1> make -r 1 raid5 1.0 1.2 1.4 1.6 1.14:r 1.10 1.12
beginning recovery of disk 1.0.4
SR shelf 1> list -l
 1 6001.229GB offline
  1.0   6001.229GB raid5 recovering,degraded  0.03%
    1.0.0  normal   1000.205GB 1.0
    1.0.1  normal   1000.205GB 1.2
    1.0.2  normal   1000.205GB 1.4
    1.0.3  normal   1000.205GB 1.6
    1.0.4  replaced 1000.205GB 1.14
    1.0.5  normal   1000.205GB 1.10
    1.0.6  normal   1000.205GB 1.12

Well, apparently it is working! The RAID is recovering, and the first replaced disk is being filled again. Until...:

aborted recovery of disk 1.0.4
raidshield: unrecoverable error on disk 1.0.1 at block 6496009
raid device 1.0.1 has failed
unrecoverable failure on raid 1.0
SR shelf 1> list -l
 1 6001.229GB offline
  1.0   6001.229GB raid5 failed
    1.0.0  normal   1000.205GB 1.0
    1.0.1  failed   1000.205GB 1.2
    1.0.2  normal   1000.205GB 1.4
    1.0.3  normal   1000.205GB 1.6
    1.0.4  replaced 1000.205GB 1.14
    1.0.5  normal   1000.205GB 1.10
    1.0.6  normal   1000.205GB 1.12

The same disk failed again, at block 6496009, while the server was recovering and calculating the disks parity. But, the good news are that it didn't fail until the recovery was more than 50% complete, and all the information I care is in the first 20% of all the disks. So, I should be able to recover it... somehow :)

SR shelf 1> eject 1
SR shelf 1> make -r 1 raid5 1.0 1.2 1.4 1.6 1.8:f 1.10 1.12
no spare large enough for 1.0.4
SR shelf 1> list -l
 1 6001.229GB offline
  1.0   6001.229GB raid5 degraded
    1.0.0  normal   1000.205GB 1.0
    1.0.1  normal   1000.205GB 1.2
    1.0.2  normal   1000.205GB 1.4
    1.0.3  normal   1000.205GB 1.6
    1.0.4  failed   1000.205GB 1.8
    1.0.5  normal   1000.205GB 1.10
    1.0.6  normal   1000.205GB 1.12
SR shelf 1> online 1

I have access to my disks again, at least if I do not try to read that failed block. Next step: Copy all the information somewhere else, and init a new RAID5 array. Or, better, a RAID6 one :-)

New comment

Please, write down your name and what you want to say :-)

Pedro, November 6, 2011

Disculpa, en donde se tienen que dar esas instrucciones si el sistema está inaccesible debido a la falla del raid?

Juan Céspedes

Programmer, Caver, Geocacher, Hacker

Tag Cloud