Skip navigation.

Syndicate

Syndicate content

User login

Major Improvements to Ruby Wireshark Wrapper

It’s been a while since I last reported on the status of my Wireshark wrapper for Ruby. This past Thanksgiving weekend I put alot of time into it, and I’m pretty pleased with the progress.

I’ve made a few major changes to accommodate my long-term use for this wrapper, which is to index and analyze hundreds of gigabytes of captured network traffic.

First, I added the ability to dump a whole packet into YAML for storage as a blob. This was a compromise, in that I wanted to preserve the dissected structure of each packet, but obviously didn’t want to create a database schema to accommodate the dozens of fields one finds in a typical packet. I figured I’d save off each packet’s YAML representation in a BLOB, then retrieve and display the whole packet’s hierarchy in a GUI if needed. Any fields that would be involved in querying or reporting would obviously need to be hoisted into database fields, but that would be a small subset of each packet’s fields.

My initial YAML implementation used the Syck engine as exposed in the Ruby standard library’s YAML class. Unfortunately, this required I query each field’s name, value, display name, and display value, which causes the creation of five Ruby wrapper objects per field. The whole reason I modified the field wrapper to defer creation of Ruby objects is to avoid the huge performance hit this incurs.

So, using the slow-but-working Syck-based implementation as a baseline, I wrote a pure C++ YAML serializer specifically tuned for serializing field hierarchies and using the C++ stringstream to efficiently build the YAML string in memory. Based on my performance numbers, this results in a mean serialization time between 0.016 seconds, and effectively 0.000 seconds (in other words, faster than the measurement resolution of the Benchmark class, compared to 0.5 seconds on average with YAML. To be sure, this is not a reflection on YAML’s serialization performance, but rather the significance of the performance gain I get from avoiding the creation of dozens or hundreds of Ruby objects per packet.

Once my C++ YAML serializer was producing YAML that parsed to a structure identical to the reference implementation based on YAML, I started to worry about large binary field values. As an example, I captured the traffic caused by downloading a 50K JPEG over HTTP. This capture contained a bunch of TCP packets, which Wireshark reassembled so the final TCP packet in the session included not just the data from the packet’s frame, but also the reassembled data consisting of the entire TCP payload for the HTTP response.

Obviously, serializing this out to YAML is somewhat inefficient. Instead, I reverse-engineered the Wireshark tvbuff_t stuff a bit more and figured out that each packet has a GSList of data_source objects, where each data_source has a name and a tvbuff_t. Normal packets have only one data_source, Frame, but the last TCP packet in a TCP segment also contains a Reassembled TCP data_source which contains the data from the entire segment. By exposing these separately, and modifying each Field object to return which data source contains its value as well as the offset into the data source where the value is located and the length in memory of the value, I can feasibly store the BLOB or BLOBs that make up each packet into the database as a binary object, and still reliably reassemble the packet or extract raw field values at will.

I think the next step is to build a basic data model for storing packets, and start loading it up then implementing some basic analysis like correlating IP addresses with hostnames, detecting interesting traffic, etc.

As usual, Commissar Richard Stallman requires I make my code available under the Marxist GPLv2; the SVN repository has the details. Note that the GPL doesn’t say anything about helping others getting shit building; it took me days to figure out the build process for Wireshark and Ruby, so I bid you good luck and godspeed.