Search

Linux Server Survives Total RAID Failure

November 7th, 2007 by James Hicks

It looks like a perfectly normal day, until I hear the screaming RAID failure alarm blaring out of the server room as I walk past. My pace picks up considerably…

“Hey, who’s screaming in the server room?” I shout as I get to our office area.

“Ah that’s Strawberry - Joan thinks its RAID controller died so she’s building a replacement from backups now.”

My brain begins to race. Joan thinks the RAID controller is dead? It’s dead. I’ve worked with Joan for over a year and I don’t bother questioning her diagnosis anymore - I can safely assume she’s correct. She’s rebuilding it from backups? Ok fine - I can’t make that go faster than the tape drive, leave her to it. How can I help her? Log in, see if I can retrieve anything.

Cos here’s the twist folks. A linux server can totally lose all access to its storage - and keep going. Today though, even I am in for some surprises.

Linux Server Survives Total RAID Failure

I log in. Cool - it’s still running. The kernel’s obviously still going and the network’s fine. Wait a minute…

“HEY, if the network’s fine on this thing, what’s it NOT doing? Aren’t we just using it as a router?”

“Yeah but it’s doing DHCP and that’s crashed.”

Crap. Ah well, what can I get?

I try an ls on /etc/… no good. The system can’t load the binary ‘ls’. No directory listings for me. Ok, well can I edit the dhcp server configuration file?

#vi /etc/dhcpd.conf

No good - the system times out trying to load the ‘vi’ editor from the disk. Ok… what can I do?

#cat /etc/dhcpd.conf

bingo! Cat doesn’t need to be loaded from the disk. Why? It’s called during nightly automatic jobs on this machine, and so it’s cached in memory still from the last time it was loaded! Better yet, Linux is clever enough not to even _try_ to get to the dead disks if the program it’s trying to load is in memory!

I copy the dhcpd.conf file into an email, along with the network and routing config, and the WAN link configuration and send it to Joan. That ought to speed things up a bit for her!

Hmm, what else can we do?

Can I restart the dhcp server?

#service dhcpd start

Cannot open leases file /var/lib/dhcp/dhcpd.leases

Well, that makes sense. The DHCP server remembers what IP address it’s given out to whom in that file. It can’t open the file though, because it’s on the dead disk, so it wont load :(

What to do?

I take a break and have a coffee. There has to be a way around this.

Suddenly, I get a flash of inspiration!

If I could get DHCP to put its leases file in /dev/shm (which exists in RAM, not on the disk!) it could run, and we could have this server doing everything it used to until Joan’s replacement server is ready!

But how? I can’t edit the config file. Even if I could load an editor (put one on a floppy disk?!) I couldn’t EDIT the file - it’s on the dead disk and can’t be written to!

Mount! I could mount it!

#mount -t tmpfs /dev/shm /var/lib/dhcp/

Cannot open /etc/mtab

Damn, this game is just not fun. I check the man page for mount - I can mount without changing the /etc/mtab file…

#mount -n -t tmpfs /dev/shm /var/lib/dhcp/

No output comes back… it worked!

#touch /var/lib/dhcp/dhcpd.leases

#service dhcpd start

cannot open /var/run/dhcpd.pid

Bah!

#mount -n -t tmpfs /dev/shm /var/run/

#service dhcpd start
Starting dhcpd: [ OK ]

YES!!!

“Hey ugh, I think I got dhcpd going - can we check with the VIP’s and see if they can log in now?”

Yes, they could. Later that day, we replaced the broken old server, with a brand new one. Instead of hours of outage and a rushed replacement, we had a very short outage and a well setup new server.

That’s one thing I love about Linux. It’s so robust; it generally survives whatever it’s theoretically possible to survive, and it gives you the flexibility you need to do things like… mount random directories in RAM instead of on the dead disk they USED to live on. You don’t get your data back, but you can make a new file.

The above is a true story. It happened to me on the morning of August 16, 2004. I know, because I kept a copy of the incident report. Because I am a Nerd.

Digg!   del.icIo.us

Want this article on your site?
James E Hicks, EzineArticles.com Basic Author

Posted in Systems Administration |

9 Responses

  1. bart Says:

    Linux <3, but what the (…) were you doing using raid on a (…)ing router/dhcpserver? maybe a raid0 but that’s just useless (imho)

  2. James Hicks Says:

    A better question might be why we were using a server for a router/dhcp server.

    As to RAID… if its a live system that people depend on, you RAID it (1, 5 or 10, yes 0 is useless)

  3. Sarah Says:

    Great story!

    I’m curious though; how does your system handle user authentication? /etc/passwd must have been irretrievable due to the RAID failure?

  4. James Hicks Says:

    Hi Sarah,

    It was a fairy standard box. /etc/passwd and /etc/shadow must have been cached since the last time somebody logged in. Probably /etc/group and a bunch of other stuff as well. :)

  5. person435 Says:

    mmmm…. why not just replace the raid controller? that would have been the fastest… also. you make this sound very dramatic when its really one of the smallest problems you can run into, and dont you have a cluster of servers, in the even that one goes down the others can take up the slack…

  6. James Hicks Says:

    Hey person435.

    Replace the raid controller fast? Sure, if you have one spare. If not, enjoy that two week supplier delay, or at best the hours waiting for your support to turn up with a spare - assuming they’ve the brains to bring one.

    Nothing is faster than continuing with a broken part.

    Nowadays where I work we have many clusters… at the time this article is written about, the site I worked at had very few services clustered.

  7. person435 Says:

    I dont know if you were the “owner” of the lab, but shouldn’t you make sure you have redundancies? I mean I bet you had a box full of replacement drives in the event that one died or several. I think keeping stock is important and having replacement raid controllers is sysadmin 101. Good job at fixing the problem.. But a bad job on who ever was responsible for keeping operation critical components in stock.

  8. pop3 Says:

    RAID redundancies maybe, but the controller has failed… unless you mean multiple servers, which would be preferable.

  9. James Hicks Says:

    eh?

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.