Faster Than Grep
February 26th, 2008 by James HicksYou know the guano is hitting the fan when your boss’s boss comes up to your desk and says: “What’s faster than grep?”
It was looking like a boring day. I got in late, almost 11 (B-A-D, I usually work 10:30-18:30), and by 11:30 had nailed all the daily maintenance stuff and was looking at a series of deadlocked and waiting-for-other-guys jobs. And a few epic jobs that couldn’t really be furthered today. Then the ops manager walked up to me…
– Today’s post is another SysAdmin ‘war story’, this time from 2006 --
“What’s faster than grep?”
If you aren’t a geek, grep is a really versatile and REALLY GOD-DAMN FAST thing you search text files with. If somebody important wants something faster than grep, they are either in an unholy hurry, or have a truckload of data.
Or worst case, like this case, both.
Fourty-four gig of data all told, a 180k list of some 13,000 records to pull out. Variable-length text data, a worst-case scenario. And we need the information extracted by tomorrow morning. Oh and you need to match two different fields.
Great.
The good news? The first field has a lot of duplicates. It’s 13,000 records, but only 220 unique in the first field. I flag that as “interesting”.
The problem with a looped grep is, you’re spinning the Big Wheel fast and the little wheel slow. You’re cycling through 44gig of data 13,000 times. Nothing’s going to be in RAM (this is 2006, and I still don’t have a box with 64GB of RAM dammit.. I stuck one in at my last job with 32 though). So I come at it from the other direction. A repeated grep command was literally taking days (it was tried before they handed me the problem)
I’ve written something faster than grep before. But I cheated then, and indexed the data using something called CDB, which is the fastest database system I’ve ever seen (look it up, it is truly annihilative). But today I had no time to index anything….. one pass was going to have to be enough. How’d I do it?
I wrote it in C. No kidding, that volume of data, I need the fastest _performing_ tool I can find, or it’s gonna take weeks. Compile with -O3! PHP would take way less time to code, saving an hour or two even, but take days longer to run.
First thing I did was load the 13,000 sets of two fields into an array. This meant that the data doing most of the work was stuck in RAM, and we only had to cycle through the 44GB of junk once. This is a good thing.
Lots of bugs ensued. The usual C stuff… you hack something together and it segfaults the first 100 times you run it, while you madly run around putting print statements everywhere trying to figure out where the fire is. Meanwhile, the clock is ticking and my boss’s boss is sitting next to me or pacing around his office looking worried. Tick tock.
I get the thing working, and move it over the HPUX box that has all the data on it. I didn’t even have vim on that box, so I coded it first up on my linux desktop. I start it up there on a small sample of the data that had been working on my desktop.
It crashes. And crashes. And crashes. I run into bug after goddamn bug in the awful HP libraries. sscanf doesn’t work anything like it does under gnu libc. I’m about ready to move the data off onto a heavy linux server when my Boss (not my Boss’s Boss) mentions that he’s installed the GNU compiler and libraries on the hpux machine. Sweet! Compiles first go, runs first go now.
Except it’s too slow.
“What cpu’s in this thing man?”
“Uhhh I think it’s 4x 360Mhz”
“$%#%!”
I start transferring the data onto the aforementioned heavy linux server. It has about 10 times the processor power. I have a brief discussion with IP Engineering about LETTING ME THROUGH THE FIREWALL NOW PLZ. Then it starts….. 5MB a sec anyone? This is the kind of thing, I mention, it would be USEFUL to UPGRADE TO GIGABIT ETHERNET for. Tick tock.
Hours later…. The hours are good in a way, they give me time to do a bit of other work and think about how to make my hacked-together program go faster.
The first thing is, I realise, smacking myself in the head, that I’m matching stuff and then continuing to compare the rest of the 13,000 wanted records to the line after it’s been matched! I fix this, and then I remember the “interesting” thing from earlier.
220 unique records in the first field only. I write a loop (god I hate C, this would be another one-liner in php :() to find these and put them into an array. I use this array to “screen” each record before scanning it. If the first field isn’t in the 220, the line doesn’t get compared to the 13,000 - we skip it and move on.
This actually slows down my sample data noticeably. But I know that my sample data has an unusually high number of matches - I figure my program is just doing an extra 220 comparisons on each line for that data, whereas the bulk of the REAL data will be skipped over quickly by this code. How do I test that theory? I’m able to run the program on the incomplete file as it downloads. Don’t try this on Windows
I’m getting what looks like good results. It’s fast, and it’s Getting the Data.
I’m running out of hours. The first half of the data (glad it’s in two separate files now!) is finished downloading and I get to work on it for real. Looks good… it’s matching stuff, and the counter I’ve got in the program that prints a line when it gets to a million is clipping past fairly quickly. The dead spots in the data without any records in the 220 list fly past, and I estimate it’s going to take just 90 minutes to get through the first file! 285 million records in 90 minutes is just fine by me.
WE ARE GOING TO WIN.
The rest is easy. A few hours later, the second file finishes downloading. I start it up and walk home. It’s done by the time I get there and VPN in. Sweet. Home by 9pm, and the work done!
Today reminded me why I love my job.
del.icIo.us
|
Posted in Systems Administration |
30 Responses
Leave a Comment
February 26th, 2008 at 5:06 am
what game is the picture in your banner taken from?
February 26th, 2008 at 6:44 am
Really nice read man. I’ve read a few stories similar to this through SU and I think they’re great.
February 26th, 2008 at 9:29 am
Brilliant! I loved every minute of reading it. Would be good if you could post snippets from your code too…
…bit of shame that the first muppet who posted wants to know the picture in your banner..
February 26th, 2008 at 10:43 am
I love it when I pull off hacks like that. Gives me a bit of a God-complex
I hope your boss’ boss actually thanked you! It’s rare for us IT guys to get the thanks deserved…
February 26th, 2008 at 2:16 pm
Absolutely epic. Kind of like raising a barn in one day, all by yourself.
Do you think you could have gotten close performance in lisp? I bet coding it would have been a lot easier.
February 26th, 2008 at 2:17 pm
dude, i feel like such a dork for reading sysadmin blogs like this; but it was a good hacker story. kudos to you! fuck C! the real trick is to allocate the 3 days that php is gonna take to do this
February 26th, 2008 at 3:06 pm
Vanguard:SOH i think
February 26th, 2008 at 7:03 pm
I’m just getting started into the wold of code having a design and photography background. I thought this was a great read. Thanks for sharing.
-J
February 26th, 2008 at 8:11 pm
Thanks for the great feedback guys!
The picture in the banner is from Vanguard, Saga of Heroes as Garth suggested.
For the record yes indeed my Boss and his Boss were all very pleased with how things went - as were the higher-ups who needed the data!
Lisp performance: No idea
Feeling like a dork for reading sysadmin blogs? Surely not! Nerd yes, dork no
Allocating the 3 days PHP would have taken = my preferred approach. Anything but heroics makes for better business… but not as good a story.
Thanks folks!
February 27th, 2008 at 6:00 pm
Hey mate,
I’m a CS student at a university. This was a really cool read. It’s amazing to think all these mundane assignments will lead to this type of real world application. I guess I truly am a nerd. Cheers!
February 28th, 2008 at 3:58 pm
Hmm… seems like perl would have been the better choice languages for this task. After all, it is the “Practical Extrapolation and Reporting Language”.
February 29th, 2008 at 4:47 am
i read it all i am still in highschool so don’t expect a great coder now but in a few years i will , thanks for the reading the expression don’t try this on windows ruled and this only helped for myself to wish i could study coding more but i guess i can study that after school.
ps. i am from mexico so if it is a little hard to understand sorry.
February 29th, 2008 at 10:51 pm
I coding data manipulation…any idea where somebody like me [ie, a random guy] could find enough real data to try new techniques and languages to manipulate it?
March 1st, 2008 at 12:27 pm
Nice job! These types of emergencies make the boss guys remember why they have all those nerds stuffed away in the basement.
March 2nd, 2008 at 3:29 pm
I like the last line. But you did not describe the actual problem really well. or may be I did not truly understood that. Anyway, this was nice.
March 2nd, 2008 at 3:32 pm
now that’s interesting, interesting indeed
March 3rd, 2008 at 10:08 pm
This is an awesome story. I love these tales from the trenches…
March 12th, 2008 at 1:26 am
I work in db’s field, when i first started i’ve seen some stuff that made me cry, nice work on making it faster than grep but i think the real problem is how the heck the db designer got away with 40 GB of 220 uniq records ? he should be fired.
Ken : for random guys that need that ammount of data -> generate it.
Nice job again.
March 12th, 2008 at 2:33 pm
Yea, nothings faster than grep.. Dealing with a log file, having perl:
foreach () {
next if (!~ /\|$mip\|/);
print “$_\n”;
}
took 8 seconds on the data, while:
$grepstring = “/usr/bin/grep \”\|$mip\|\”";
system ($grepstring);
took 1.6 seconds.
March 13th, 2008 at 8:54 pm
Boy.. the real thing is to find out what problems you tackled before they through such a nice piece in your way…
March 14th, 2008 at 3:00 pm
[...] together that ran faster than grep? James Hicks has. And he lived to tell about it on his blog isnerd.net. James uses some interesting properties of the data he is matching to write a lighting fast [...]
March 20th, 2008 at 12:22 pm
if only you could master tsm….
March 20th, 2008 at 7:48 pm
TSM!
Give me Amanda any day. Could write a whole article on backup software… but would anyone read it?
March 20th, 2008 at 8:16 pm
I personally love C, but I definitely agree that it is very hard to hack something up quickly and get it to work flawlessly. Juggling pointers and memory allocation in C is somewhat of an art form, and it takes a lot of care and practice, and most of all _time_. Languages like Perl and PHP are definitely more suited for quick ditties, but have subsequent hits in performance, which are usually not noticed when working on data several magnitudes smaller than your example. Its really just a function of whether or not you can afford the extra processing time. If this is a general solution that can be used on many large files in the future, you may want to keep those C sources.
April 7th, 2008 at 1:30 pm
Good story, enjoyed reading.
I have had similar stressful situations but not quite like yours - I am not as proficient as C coder as yourself, probably would have searched the web for something similar to what I wanted and hacked it to work then hoped for the best
April 9th, 2008 at 5:56 am
How did you find the unique records in the first field? The naive approach can take o(n^2). One way is the have the fields added to a binary tree and then reduce it to an array. This probably would had increased the programming overhead though. Another approach is to sort the first field and pick the unique records this way. This would probably be the easiest approach for a 0(n lg n) approach.
April 9th, 2008 at 8:44 pm
Jesus: Given that we only needed to scan 13k records to find the unique members of the first field, any approach that involved a computer was likely to work
Geoffrey: I don’t claim to be a proficient C coder either
Iain: My C sources are now the property of my previous employer, who probably have not kept them.
Apolodor: There were not 220 unique fields in the 40GB of data… of the 13,000 we wanted, the first field had only 220 unique members.
April 10th, 2008 at 11:43 am
egrep ‘^(opt0|opt1|opt2|…|opt219)’ file
maybe I missed something, but why would this not work?
April 10th, 2008 at 9:50 pm
egrep: I’m guessing egrep wont accept a 180KB + overhead commandline. It was probably tried before they tried looping/nesting greps, which was before they handed me the problem.
However, if you could do that, I’m sure it would work. It might take many hours to chew through it all rather than a few hours… but it would work.
September 4th, 2008 at 2:41 pm
This sounds a lot like my job. We don’t usually get data sets *that* big, but I’m not gonna bet against it, either.