Search

Faster Than Grep

February 26th, 2008 by James Hicks

You know the guano is hitting the fan when your boss’s boss comes up to your desk and says: “What’s faster than grep?

It was looking like a boring day. I got in late, almost 11 (B-A-D, I usually work 10:30-18:30), and by 11:30 had nailed all the daily maintenance stuff and was looking at a series of deadlocked and waiting-for-other-guys jobs. And a few epic jobs that couldn’t really be furthered today. Then the ops manager walked up to me…

– Today’s post is another SysAdmin ‘war story’, this time from 2006 --

“What’s faster than grep?”

If you aren’t a geek, grep is a really versatile and REALLY GOD-DAMN FAST thing you search text files with. If somebody important wants something faster than grep, they are either in an unholy hurry, or have a truckload of data.

Or worst case, like this case, both.

Fourty-four gig of data all told, a 180k list of some 13,000 records to pull out. Variable-length text data, a worst-case scenario. And we need the information extracted by tomorrow morning. Oh and you need to match two different fields.

Great.

The good news? The first field has a lot of duplicates. It’s 13,000 records, but only 220 unique in the first field. I flag that as “interesting”.

The problem with a looped grep is, you’re spinning the Big Wheel fast and the little wheel slow. You’re cycling through 44gig of data 13,000 times. Nothing’s going to be in RAM (this is 2006, and I still don’t have a box with 64GB of RAM dammit.. I stuck one in at my last job with 32 though). So I come at it from the other direction. A repeated grep command was literally taking days (it was tried before they handed me the problem)

I’ve written something faster than grep before. But I cheated then, and indexed the data using something called CDB, which is the fastest database system I’ve ever seen (look it up, it is truly annihilative). But today I had no time to index anything….. one pass was going to have to be enough. How’d I do it?

I wrote it in C. No kidding, that volume of data, I need the fastest _performing_ tool I can find, or it’s gonna take weeks. Compile with -O3! PHP would take way less time to code, saving an hour or two even, but take days longer to run.

First thing I did was load the 13,000 sets of two fields into an array. This meant that the data doing most of the work was stuck in RAM, and we only had to cycle through the 44GB of junk once. This is a good thing.

Lots of bugs ensued. The usual C stuff… you hack something together and it segfaults the first 100 times you run it, while you madly run around putting print statements everywhere trying to figure out where the fire is. Meanwhile, the clock is ticking and my boss’s boss is sitting next to me or pacing around his office looking worried. Tick tock.

I get the thing working, and move it over the HPUX box that has all the data on it. I didn’t even have vim on that box, so I coded it first up on my linux desktop. I start it up there on a small sample of the data that had been working on my desktop.

It crashes. And crashes. And crashes. I run into bug after goddamn bug in the awful HP libraries. sscanf doesn’t work anything like it does under gnu libc. I’m about ready to move the data off onto a heavy linux server when my Boss (not my Boss’s Boss) mentions that he’s installed the GNU compiler and libraries on the hpux machine. Sweet! Compiles first go, runs first go now.

Except it’s too slow.

“What cpu’s in this thing man?”

“Uhhh I think it’s 4x 360Mhz”

“$%#%!”

I start transferring the data onto the aforementioned heavy linux server. It has about 10 times the processor power. I have a brief discussion with IP Engineering about LETTING ME THROUGH THE FIREWALL NOW PLZ. Then it starts….. 5MB a sec anyone? This is the kind of thing, I mention, it would be USEFUL to UPGRADE TO GIGABIT ETHERNET for. Tick tock.

Hours later…. The hours are good in a way, they give me time to do a bit of other work and think about how to make my hacked-together program go faster.

The first thing is, I realise, smacking myself in the head, that I’m matching stuff and then continuing to compare the rest of the 13,000 wanted records to the line after it’s been matched! I fix this, and then I remember the “interesting” thing from earlier.

220 unique records in the first field only. I write a loop (god I hate C, this would be another one-liner in php :() to find these and put them into an array. I use this array to “screen” each record before scanning it. If the first field isn’t in the 220, the line doesn’t get compared to the 13,000 - we skip it and move on.

This actually slows down my sample data noticeably. But I know that my sample data has an unusually high number of matches - I figure my program is just doing an extra 220 comparisons on each line for that data, whereas the bulk of the REAL data will be skipped over quickly by this code. How do I test that theory? I’m able to run the program on the incomplete file as it downloads. Don’t try this on Windows :) I’m getting what looks like good results. It’s fast, and it’s Getting the Data.

I’m running out of hours. The first half of the data (glad it’s in two separate files now!) is finished downloading and I get to work on it for real. Looks good… it’s matching stuff, and the counter I’ve got in the program that prints a line when it gets to a million is clipping past fairly quickly. The dead spots in the data without any records in the 220 list fly past, and I estimate it’s going to take just 90 minutes to get through the first file! 285 million records in 90 minutes is just fine by me.
WE ARE GOING TO WIN.

The rest is easy. A few hours later, the second file finishes downloading. I start it up and walk home. It’s done by the time I get there and VPN in. Sweet. Home by 9pm, and the work done!

Today reminded me why I love my job.

Digg!   del.icIo.us

Want this article on your site?
James E Hicks, EzineArticles.com Basic Author

Posted in Systems Administration |

30 Responses

  1. Matt Says:

    what game is the picture in your banner taken from?

  2. Ste Says:

    Really nice read man. I’ve read a few stories similar to this through SU and I think they’re great.

  3. Manav Gupta Says:

    Brilliant! I loved every minute of reading it. Would be good if you could post snippets from your code too…

    …bit of shame that the first muppet who posted wants to know the picture in your banner..

  4. mngrif Says:

    I love it when I pull off hacks like that. Gives me a bit of a God-complex :P

    I hope your boss’ boss actually thanked you! It’s rare for us IT guys to get the thanks deserved…

  5. Alec Says:

    Absolutely epic. Kind of like raising a barn in one day, all by yourself.

    Do you think you could have gotten close performance in lisp? I bet coding it would have been a lot easier.

  6. sikanrong Says:

    dude, i feel like such a dork for reading sysadmin blogs like this; but it was a good hacker story. kudos to you! fuck C! the real trick is to allocate the 3 days that php is gonna take to do this :)

  7. Garth Bluntaxe Says:

    Vanguard:SOH i think

  8. -Jon Says:

    I’m just getting started into the wold of code having a design and photography background. I thought this was a great read. Thanks for sharing.

    -J

  9. James Hicks Says:

    Thanks for the great feedback guys!

    The picture in the banner is from Vanguard, Saga of Heroes as Garth suggested.

    For the record yes indeed my Boss and his Boss were all very pleased with how things went - as were the higher-ups who needed the data!

    Lisp performance: No idea :)

    Feeling like a dork for reading sysadmin blogs? Surely not! Nerd yes, dork no ;)

    Allocating the 3 days PHP would have taken = my preferred approach. Anything but heroics makes for better business… but not as good a story.

    Thanks folks!

  10. Nelson Says:

    Hey mate,
    I’m a CS student at a university. This was a really cool read. It’s amazing to think all these mundane assignments will lead to this type of real world application. I guess I truly am a nerd. Cheers!

  11. The d00d Says:

    Hmm… seems like perl would have been the better choice languages for this task. After all, it is the “Practical Extrapolation and Reporting Language”.

  12. HiZaM Says:

    i read it all i am still in highschool so don’t expect a great coder now but in a few years i will , thanks for the reading the expression don’t try this on windows ruled and this only helped for myself to wish i could study coding more but i guess i can study that after school.

    ps. i am from mexico so if it is a little hard to understand sorry.

  13. Ken Says:

    I coding data manipulation…any idea where somebody like me [ie, a random guy] could find enough real data to try new techniques and languages to manipulate it?

  14. Caffeine Addict Says:

    Nice job! These types of emergencies make the boss guys remember why they have all those nerds stuffed away in the basement.

  15. WordPress Guru Says:

    I like the last line. But you did not describe the actual problem really well. or may be I did not truly understood that. Anyway, this was nice.

  16. sir jorge Says:

    now that’s interesting, interesting indeed

  17. Some Dude Says:

    This is an awesome story. I love these tales from the trenches…

  18. apolodor.ln Says:

    I work in db’s field, when i first started i’ve seen some stuff that made me cry, nice work on making it faster than grep but i think the real problem is how the heck the db designer got away with 40 GB of 220 uniq records ? he should be fired.

    Ken : for random guys that need that ammount of data -> generate it.

    Nice job again.

  19. pete Says:

    Yea, nothings faster than grep.. Dealing with a log file, having perl:

    foreach () {
    next if (!~ /\|$mip\|/);
    print “$_\n”;
    }

    took 8 seconds on the data, while:

    $grepstring = “/usr/bin/grep \”\|$mip\|\”";
    system ($grepstring);

    took 1.6 seconds.

  20. Some_Nerd Says:

    Boy.. the real thing is to find out what problems you tackled before they through such a nice piece in your way…

  21. Sweet Hacks - Vol I | GrokCode Says:

    [...] together that ran faster than grep? James Hicks has. And he lived to tell about it on his blog isnerd.net. James uses some interesting properties of the data he is matching to write a lighting fast [...]

  22. davidhp Says:

    if only you could master tsm….

  23. James Hicks Says:

    TSM!

    Give me Amanda any day. Could write a whole article on backup software… but would anyone read it?

  24. lain_proliant Says:

    I personally love C, but I definitely agree that it is very hard to hack something up quickly and get it to work flawlessly. Juggling pointers and memory allocation in C is somewhat of an art form, and it takes a lot of care and practice, and most of all _time_. Languages like Perl and PHP are definitely more suited for quick ditties, but have subsequent hits in performance, which are usually not noticed when working on data several magnitudes smaller than your example. Its really just a function of whether or not you can afford the extra processing time. If this is a general solution that can be used on many large files in the future, you may want to keep those C sources.

  25. Geoffrey Says:

    Good story, enjoyed reading.

    I have had similar stressful situations but not quite like yours - I am not as proficient as C coder as yourself, probably would have searched the web for something similar to what I wanted and hacked it to work then hoped for the best :)

  26. Jesus DeLaTorre Says:

    How did you find the unique records in the first field? The naive approach can take o(n^2). One way is the have the fields added to a binary tree and then reduce it to an array. This probably would had increased the programming overhead though. Another approach is to sort the first field and pick the unique records this way. This would probably be the easiest approach for a 0(n lg n) approach.

  27. James Hicks Says:

    Jesus: Given that we only needed to scan 13k records to find the unique members of the first field, any approach that involved a computer was likely to work :)

    Geoffrey: I don’t claim to be a proficient C coder either ;)

    Iain: My C sources are now the property of my previous employer, who probably have not kept them.

    Apolodor: There were not 220 unique fields in the 40GB of data… of the 13,000 we wanted, the first field had only 220 unique members.

  28. egrep Says:

    egrep ‘^(opt0|opt1|opt2|…|opt219)’ file

    maybe I missed something, but why would this not work?

  29. James Hicks Says:

    egrep: I’m guessing egrep wont accept a 180KB + overhead commandline. It was probably tried before they tried looping/nesting greps, which was before they handed me the problem.

    However, if you could do that, I’m sure it would work. It might take many hours to chew through it all rather than a few hours… but it would work.

  30. S Says:

    This sounds a lot like my job. We don’t usually get data sets *that* big, but I’m not gonna bet against it, either.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.