Linux Server Survives Total RAID Failure
November 7th, 2007 by James HicksIt looks like a perfectly normal day, until I hear the screaming RAID failure alarm blaring out of the server room as I walk past. My pace picks up considerably…
“Hey, who’s screaming in the server room?” I shout as I get to our office area.
“Ah that’s Strawberry - Joan thinks its RAID controller died so she’s building a replacement from backups now.”
My brain begins to race. Joan thinks the RAID controller is dead? It’s dead. I’ve worked with Joan for over a year and I don’t bother questioning her diagnosis anymore - I can safely assume she’s correct. She’s rebuilding it from backups? Ok fine - I can’t make that go faster than the tape drive, leave her to it. How can I help her? Log in, see if I can retrieve anything.
Cos here’s the twist folks. A linux server can totally lose all access to its storage - and keep going. Today though, even I am in for some surprises.
Linux Server Survives Total RAID Failure
I log in. Cool - it’s still running. The kernel’s obviously still going and the network’s fine. Wait a minute…
“HEY, if the network’s fine on this thing, what’s it NOT doing? Aren’t we just using it as a router?”
“Yeah but it’s doing DHCP and that’s crashed.”
Crap. Ah well, what can I get?
I try an ls on /etc/… no good. The system can’t load the binary ‘ls’. No directory listings for me. Ok, well can I edit the dhcp server configuration file?
#vi /etc/dhcpd.conf
No good - the system times out trying to load the ‘vi’ editor from the disk. Ok… what can I do?
#cat /etc/dhcpd.conf
bingo! Cat doesn’t need to be loaded from the disk. Why? It’s called during nightly automatic jobs on this machine, and so it’s cached in memory still from the last time it was loaded! Better yet, Linux is clever enough not to even _try_ to get to the dead disks if the program it’s trying to load is in memory!
I copy the dhcpd.conf file into an email, along with the network and routing config, and the WAN link configuration and send it to Joan. That ought to speed things up a bit for her!
Hmm, what else can we do?
Can I restart the dhcp server?
#service dhcpd start
Cannot open leases file /var/lib/dhcp/dhcpd.leases
Well, that makes sense. The DHCP server remembers what IP address it’s given out to whom in that file. It can’t open the file though, because it’s on the dead disk, so it wont load
What to do?
I take a break and have a coffee. There has to be a way around this.

Suddenly, I get a flash of inspiration!
If I could get DHCP to put its leases file in /dev/shm (which exists in RAM, not on the disk!) it could run, and we could have this server doing everything it used to until Joan’s replacement server is ready!
But how? I can’t edit the config file. Even if I could load an editor (put one on a floppy disk?!) I couldn’t EDIT the file - it’s on the dead disk and can’t be written to!
Mount! I could mount it!
#mount -t tmpfs /dev/shm /var/lib/dhcp/
Cannot open /etc/mtab
Damn, this game is just not fun. I check the man page for mount - I can mount without changing the /etc/mtab file…
#mount -n -t tmpfs /dev/shm /var/lib/dhcp/
No output comes back… it worked!
#touch /var/lib/dhcp/dhcpd.leases
#service dhcpd start
cannot open /var/run/dhcpd.pid
Bah!
#mount -n -t tmpfs /dev/shm /var/run/
#service dhcpd start
Starting dhcpd: [ OK ]
YES!!!
“Hey ugh, I think I got dhcpd going - can we check with the VIP’s and see if they can log in now?”
Yes, they could. Later that day, we replaced the broken old server, with a brand new one. Instead of hours of outage and a rushed replacement, we had a very short outage and a well setup new server.
That’s one thing I love about Linux. It’s so robust; it generally survives whatever it’s theoretically possible to survive, and it gives you the flexibility you need to do things like… mount random directories in RAM instead of on the dead disk they USED to live on. You don’t get your data back, but you can make a new file.
The above is a true story. It happened to me on the morning of August 16, 2004. I know, because I kept a copy of the incident report. Because I am a Nerd.
del.icIo.us
|
Posted in Systems Administration |
9 Responses
Leave a Comment
February 8th, 2008 at 2:35 am
Linux <3, but what the (…) were you doing using raid on a (…)ing router/dhcpserver? maybe a raid0 but that’s just useless (imho)
February 9th, 2008 at 1:54 am
A better question might be why we were using a server for a router/dhcp server.
As to RAID… if its a live system that people depend on, you RAID it (1, 5 or 10, yes 0 is useless)
March 11th, 2008 at 1:41 am
Great story!
I’m curious though; how does your system handle user authentication? /etc/passwd must have been irretrievable due to the RAID failure?
March 11th, 2008 at 2:14 am
Hi Sarah,
It was a fairy standard box. /etc/passwd and /etc/shadow must have been cached since the last time somebody logged in. Probably /etc/group and a bunch of other stuff as well.
June 5th, 2009 at 12:44 pm
mmmm…. why not just replace the raid controller? that would have been the fastest… also. you make this sound very dramatic when its really one of the smallest problems you can run into, and dont you have a cluster of servers, in the even that one goes down the others can take up the slack…
June 5th, 2009 at 8:21 pm
Hey person435.
Replace the raid controller fast? Sure, if you have one spare. If not, enjoy that two week supplier delay, or at best the hours waiting for your support to turn up with a spare - assuming they’ve the brains to bring one.
Nothing is faster than continuing with a broken part.
Nowadays where I work we have many clusters… at the time this article is written about, the site I worked at had very few services clustered.
June 6th, 2009 at 12:34 am
I dont know if you were the “owner” of the lab, but shouldn’t you make sure you have redundancies? I mean I bet you had a box full of replacement drives in the event that one died or several. I think keeping stock is important and having replacement raid controllers is sysadmin 101. Good job at fixing the problem.. But a bad job on who ever was responsible for keeping operation critical components in stock.
June 17th, 2010 at 3:07 am
RAID redundancies maybe, but the controller has failed… unless you mean multiple servers, which would be preferable.
June 17th, 2010 at 10:01 pm
eh?