Skip navigation.

Syndicate

Syndicate content

User login

Building Ruby on Windows, and performance

Last time, I encountered horrifying performance with my Ruby extension, and had two action items:

  • Build Ruby from sources so I’d have debug information
  • Profile my extension using Intel VTune

I was actually shocked how easy it was to build Ruby from sources. Under windows it’s literally just:

 win32\configure
 nmake
 nmake test
 nmake DESTDIR=foo install

Seriously. I did have to change win32\Makefile.sub to add /fixed:no to the linker command, since VTune won’t work with modules that are not relocatable, but other than that it was a no-brainer. All this makes me wonder why the official Windows builds of Ruby aren’t built with VC2k5 when it’s so superior. In fact, my test ran on the VC2k5 version of Ruby nearly twenty seconds faster, 75 seconds instead of 94!

Anyway, with that done I adjusted my VC2k5 extension project to copy the DLLs to the new ruby path, and got underway.

Let me now digress for a moment and point out just how horrifyingly bad Intel VTune is. I use VTune at work for performance-tuning our server apps, and whenever I encounter a performance problem I exhaust all other alternatives before I bring VTune to bear; it’s that bad.

First off, one gets the feeling that, despite being in version 9.x, VTune is written and maintained by interns. It’s GUI is clunky, its installer is temperamental, it crashes for no discernible reason, it won’t run at all without admin privs (seriously, not at all; won’t even start) and the support board is full of questions in broken English and answers to the effect of ‘is it plugged in? did you turn it on? try calling the support line’.

VTune is unique among profilers in that it has two ways of profiling. What it calls ‘call graph profiling’ is the typical profiler functionality, which instruments all your code, makes it run 100 times slower, then when it’s done running, shows you the complete call graph with time spent in each function, helping you see where the slow spots in your app are.

VTune’s other profiling solution, and that which sets it apart, is based on taking snapshots of the processor state based on triggers like n instructions retired. In each snapshot, VTune notes where execution is at that instant. This snapshot approach doesn’t require instrumenting code, and it doesn’t slow it down that much, but there is one HUGE downside: no call graph. It can tell you your app spent all its time in malloc, but it can’t tell you who called the mallocs that it spent all its time in.

Not surprisingly, Intel extols this snapshot-based profiling as though it’s actually usable, but I’ve never run into a situation in which I didn’t end up using the call graph profiling to get the info I want. This would just be an annoyance, except call graph profiling crashes almost every time I try to use it.

Back to my current problem, I was using call graph profiling, and sure enough, the app I was profiling would crash on startup. Intel says this happens if you’re using modules that aren’t relocatable, but it’s also supposed to tell you which modules aren’t relocatable. It wasn’t, then I fiddled around with some settings and suddenly it complained that about half the DLLs I was profiling weren’t relocatable. I removed them from the list of DLLs to keep track of, and I was off.

I ran my performance test that processes a capture file with 36k packets in it, and got the results. They were surprising to say the least.

According to vtune, my whole test run for 22 million msc, whatever that is (not milliseconds; the app took way less than 6 hours to run. Nor microseconds; it took more than 22 seconds to run; whatever). Of those 22M, 11M were spent in either malloc or free. I’m not calling either of those directly, and in fact the biggest offender in terms of calling malloc and free really calls but one method: rb_class_new_instance.

My takeaway from this is that object creation in Ruby is expensive enough that creating three million objects (my rough count; one for each packet, and one for each field within each packet) is slow. This rather confirms my suspicions that I should create one Ruby object for the packet, and wrap it around a C++ associative container to store the fields. Since object creation in C++ is pretty fast (and lightweight), this should improve performance quite a bit.