apocryph.org Notes to my future self

4Oct/0821

An analysis of Ruby 1.8.x HTTP client performance

Not too long ago I bitched about the performance of Ruby’s HTTP client. Some of the comments to that post prompted me to investigate this further, in the hopes of finding a more performant implementation solution.

The results of my analysis are in, and they’re…interesting, to say the least.

Summary

Ruby 1.8.6 (which still seems the dominant version among both Linux binary packages and the Windows One-Click Installer) uses a hard-coded 1K buffer size for HTTP reads, which leads to a ton of CPU usage during large HTTP downloads, even though the operation should be I/O bound and barely touch the CPU.

Ruby 1.8.7 includes a change described by the following entry in the changelog:

Mon Mar 19 11:39:29 2007  Minero Aoki  <aamine@loveruby.net>

    * lib/net/protocol.rb (rbuf_read): extend buffer size for speed.

After this change, Ruby’s HTTP implementation now uses a hard-coded 16K buffer, in the hopes of improving performance. Whether or not this actually improves things will become clear in my analysis later on.

In addition to Ruby’s built-in Net::HTTP client, I evaluated two alternatives: a version of the rfuzz HTTP client modified to support streaming GETs, and curb, the Ruby bindings for the native libcurl HTTP client library. My goal was to determine the best-case Ruby HTTP client performance as indicated by the performance of these two implementations, then munge Ruby’s stock implementation to try to approach the best-case performance.

rubyhttp

I wrote a tool, rubyhttp, to help me perform these tests. The code is freely available at my SVN repository at http://svn.apocryph.org/svn/projects/rubyhttp/trunk. To grab the code, do a svn co http://svn.apocryph.org/svn/projects/rubyhttp/trunk. The tests below were run with revision 127 of the code.

Test environment

I ran the tests on two machines: wyoh, a Windows XP x64-edition Core 2 Duo laptop with a FiOS internet connection, and lio, one of my FutureHosting VPS boxes running CentOS 5.

On wyoh I used the version of Ruby that comes with the latest one-click installer:

>ruby -v
ruby 1.8.6 (2007-09-24 patchlevel 111) [i386-mswin32]

On lio I tested two versions of Ruby. The first was installed by the ruby yum package:

$ ruby -v
ruby 1.8.6 (2007-09-24 patchlevel 111) [i686-linux]

As you can see, this is the same version and patchlevel as my Windows box. Once I discovered the 16k buffer enhancement in Ruby 1.8.7, I downloaded and built the latest 1.8.7 source tree. This is:

$ ~/ruby18/bin/ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [i686-linux]

Test data

My test code fetches the 10MB test files published by FutureHosting for measurement of the network performance at each of their data centers. Thus, my code retrieves data from Seattle, Dallas, Chicago, Washington DC, and London. lio is located in the very same Dallas data center, hence the crazy-high download speeds there, while wyoh is located in the suburbs of Washington DC in close geographical and network proximity to the DC datacenter.

HTTP variations

Each test run does an HTTP get from five different locations, using Net::HTTP and (on Linux only) rfuzz and curb as well. Neither rfuzz nor curb could be made to work on Windows, so the Windows runs use only Net::HTTP.

Most of the tests exercise some variation in the Net::HTTP implementation. The following variations are used:

  • stock – As it implies, the Net::HTTP implementation is unmodified from whatever ships with the version of Ruby being used
  • custom-16kbuffer – Modifies the buffer size from 1K to 16K. Note that Ruby 1.8.7 already includes this modification, so you’ll only see this run with Ruby 1.8.6 on Windows.
  • custom-16kbuffer-notimeout – Buffer size of 16K, and the timeout call is removed. This obviously isn’t a practical change, but it demonstrates the overhead of Ruby’s appalling timeout implementation
  • custom-16kbuffer-select – Buffer size of 16K, and the timeout call is replaced with non-blocking I/O using select, as proposed by Tanaka Akira on the ruby-talk list
  • custom-16kbuffer-selectwithsysread – Buffer size of 16K, the timeout call is replaced with non-blocking I/O using select, as proposed by Tanaka Akira on the ruby-talk list, and the read_nonblocking call after select indicates the presence of data to read is replaced by sysread
  • custom-64kbuffer-notimeout – Buffer size of 64K, and the timeout call is removed. This obviously isn’t a practical change, but it demonstrates the overhead of Ruby’s appalling timeout implementation
  • custom-64kbuffer-select – Buffer size of 64K, and the timeout call is replaced with non-blocking I/O using select, as proposed by Tanaka

All of these custom variations modify lib/ruby-1.8/net/protocol.rb, which contains the socket I/O functionality used by Net::HTTP. The rbuf_full method contains the actual socket read logic.

Data

Each run outputs the following information for each combination of HTTP URL and HTTP client implementation:

  • Site – name of the site (eg ‘seattle’, ‘washdc’, etc)
  • Impl – name of the HTTP implementation
  • KBytes Transferred
  • KBytes/second
  • Chunk count – The number of reads required to fetch the entire file
  • Mean chunk size – The average read size
  • Min chunk size
  • Max chunk size
  • User Time – The % of user time taken by ruby during the run
  • System Time – The % of system time taken by ruby during the run
  • Total CPU Time – The total % of CPU time taken by ruby during the run
  • Clock time – The amount of time spent downloading the file
  • % CPU usage – Defined as Total CPU Time / Clock Time * 100. The percentage of available CPU time taken by ruby

The raw data are available under SVN at results/linux/2008-10-4 and results/windows/2008-10-4. I used my combine_csv.rb tool to generate results/linux/2008-10-4/aggregate.csv from the individual test results. I ended up not using the Windows results as they complicated the graph and didn’t materially impact the conclusion.

I sucked aggregate.csv into Excel to do some munging.

Pretty Pictures

I uploaded the data to Swivel, thinking it would make it easy to analyze the data. It didn’t. I wanted to do a clustered bar graph, where each cluster corresponds to a site, and bars within that cluster reflect CPU usage for each implementation when downloading from that site. Swivel is way too limited for that.

The best I can so is this graph, which clusters by implementation and graphs CPU usage for each site; the opposite of what I wanted. You can play with the data yourself if you like.

Using good old fashion Excel, I generated this graph:

Ruby HTTP implementations performance

As you can see, the worst performers are the stock Net::HTTP implementations in both 1.8.6 and 1.8.7, though 1.8.6 is noticeably worse due to the 1K buffer size vs 16K for 1.8.7. The best performer is curb (libcurl bindings for Ruby), under with 1.8.6 and 1.8.7. The fastest Net::HTTP-based implementation uses a 16K buffer size and bypasses the timeout method, which is apparently quite inefficient. Using the non-blocking select to implement a timeout is slower than no timeout at all, but still considerably better than the stock impl. Finally, the 64k buffer size variants were actually worse performance-size than the 16K variants.

It’s also quite obvious that Dallas transfers took up the most CPU, while London took the least. What you can’t see from this graph, but would see in the raw data, is that Dallas transfers were crazy-fast (since these tests were run on the same network as the Dallas test file), so there was less wall-clock time spent on the test, thus the transfer was less I/O bound than others. For the same reason, London, by far the slowest transfer, uses the least amount of CPU. This does not mean that transfers from fast download sites are inherently less efficient. If instead of %CPU time I used the total CPU time column, this disparity would vanish.

Conclusion

Ruby’s Net::HTTP implementation blows. It’s a bit better in 1.8.7 with the new 16K buffer size, but the timeout implementation has got to go. Even with timeout eliminated, Net::HTTP is trounced by the pure-Ruby rfuzz and the native/Ruby blend curb, suggesting that timeout notwithstanding, there are other inefficiencies in Net::HTTP. Looking at the protocol.rb code, I’m struck by how painfully inefficient the implementation is with buffers. rfuzz and curb minimize buffer copies and my rfuzz streaming HTTP extension reuses the same buffer for multiple calls, while Net::HTTP is happily appending and sliceing away at arrays.

I think architecturally Net::HTTP can be saved, but it needs rewritten buffered I/O and an alternative to timeout, preferably in the form of select.

I’m going to try to work on the necessary changes, and will post whatever I come up with.

29Sep/060

HUGE gotcha in TransmitPackets

At work I’ve been trying to get TransmitPackets to work in the hopes of improving our performance. In the process I ran across some behavior that can’t be right.

According to the SDK docs, the nSendSize parameter specifies how much data at a time TransmitPackets will send to Winsock. I left this at 0, which uses the OS default (according to Network Programming for Microsoft Windows, Second Edition, the OS default is 64k). However, when I did that, as soon as I passed in an array of TRANSMIT_PACKETS_ELEMENTs whose sum total bytes exceeded 64k, the overlapped operation would fail with ERROR_INVALID_ARGUMENT.

I soon found out that if I set nSendSize to a value greater than or equal to the total number of bytes I was sending with each TransmitPackets call, it would work. No, that doesn’t make sense, and no, I can’t find any independent confirmation. That said, it was a HUGE gotcha.

13Jul/060

Test harness for Win32 network and disk performance tests

My recent investigations into Win32 socket performance led me to a few performance measuring tools, like iperf and netperf. However, in my case I wanted some extra features:

  • Use of Win32 IO completion ports for disk and network IO
  • Use of Win32 TransmitFile/TransmitPackets high-performance socket routines
  • Benchmarking of disk read/write performance as a part of overall throughput

So, yesterday I threw together a quick-and-dirty test harness to exercise these features. The code isn’t written for maintainability or readability; the point was to get something out quick which I could use to explore the performance landscape.

The sources are in my svn repository, and I’ve attached a source and Win32 binary tarball based on a snapshot of the code today.

The code requires a client and a server at each end. It doesn’t use any particular wire protocol; just a stream of bytes followed by a connection close. In fact, you can reproduce its client functionality with a nc whateverhost 12345 < srcfile, and its server functionality with nc -l 12345 > destfile. This comes in handy if you want to test against a UNIX host on which AsyncIoTest won’t run.

The same binary, AsyncIoTest.exe, can run as a server or a client. In both server and client mode, you can opt to run only network, only disk, or network and disk (default) tests. In network mode, the client sends random bytes to the server, while the server reads bytes and ignores them. In disk mode, the client reads from a source file as fast as it can, then drops the resulting data, while the server writes to a target file as fast as it can. In combined mode, which is the default, the client reads from a source file and sends it over the network, while the server reads data from the client and writes it to a target file.

So, the basic commands are:

To run a server:

asynciotest -s

To run a client:

asynciotest -c serverhost -f sourcefile

Where serverhost is the hostname or IP address of the box running asynciotest -s, and sourcefile is the path and file name of the file you want to read from and send. The server writes to a hard-coded file name, which it creates in the working directory and truncates with each connection.

To put either a client or server in disk-only mode, add the -d switch. Note that, in -d mode, the client and server can be run independent of one another, with the client reading data and dropping it, and the server making up random data and writing it.

To use network-only mode, use the -n switch. Unlike -d, in -n mode the client and the server remain dependent upon one another; they just don’t do anything with files.

When the client is in -n mode, or the server in -d mode, you must also provide the -l length parameter, where length is the amount of data you want to send over the network (client) or write to the file (server). length can be followed by scale values k, m, g, K, M, G. k denotes a scale of 10^3, m 10^6, and g 10^9, while K, M, and G denote 2^10, 2^20, and 2^30, respectively.

You can force the client to use TransmitFile by passing it the -t switch when it’s in network-and-disk mode. If you pass -t when the client is in -n mode, TransmitPackets will be used instead. Passing -t in server mode or with the -d switch is an error.

You can set the size of the chunks used for reading and writing with the -k chunksize parameter; like -l you can use scale values for readability.

The TCP send and receive buffers can be set with -b bufsize; again, scale values are recognized.

The number of outstanding async ops to maintain in the queue is set with -o opcount. The default value is 2, depending on the speed of your I/O you may want to go higher, but keep it reasonable; 8 or 10 is probably the upper boundary.

If you must, you can override the port used with -p port.

The raison d’etre of this tool is to measure performance implications of various I/O subsystems with a few possible Win32 I/O API calls. Some good tests would be:

asynciotest -s -n and asynciotest -c whatever -n -l 500M

To measure raw network throughput. Try adding -k 256k and -b 64k to see if larger TCP buffers and chunk sizes impact performance. Then try adding -t to the client to compare the performance of WSASend with TransmitPackets.

Once you’ve a sense for the raw network throughput, add the filesystem into the equation. Start with the client only, removing -n -l 500M and replacing it with -f file where file is the path of a large, relatively unfragmented file. If you’re transferring over a LAN, expect the file reads to significantly reduce throughput compared to network-only mode. Also experiment with -t on the client side, which will use the high-performance TransmitFile API instead of repeated WSASend calls.

Also beware of multiple runs of the client test; the source file will be cached by the Win32 cache manager, so you can expect subsequent runs with the same file to perform a bit better as a result. To account for this disparity, reboot the client between runs (yes, I know, that sucks).

Next, remove the -f file and replace with -n -l 500M on the client, and remove -n on the server. This removes the client disk from the I/O equation, but adds the server disk corresponding to the working directory of the AsyncIoTest server process. Compare this with the results from the client I/O test, then add back in the client I/O as well and see what happens.

There are a number of permutations you might try, but these should expose the key corners of the performance space.

Delicious Bookmarks

Recent Posts

Meta

Current Location