Performance investigation of Java's select() in Windows

Java has had select()-based I/O since (I believe) 1.4, through java.nio.channels.Selector and the supporting API. While network I/O over non-blocking SocketChannels has been working fine in one of our [Solace Systems'] messaging software platforms for a long time, and at a more than acceptable throughput, I had never really attempted to precisely measure typical timings of Selector.select() and SocketChannel.write(). That is, until this week, when a coworker coding against Winsock2 wanted to compare his timings against what we got doing similar work in Java. What I found was quite surprising...

Test Design

To get an idea of timings, I quickly bashed out your basic "Hello World" of non-blocking-SocketChannel-using applications (which we'll call the client side), that simply streams data as fast as possible to a netcat (nc) instance in listen mode (the server side). In a Java program, we connect to a remote listening port that was created like so:

nc -kl $PORT > /dev/null

Then we register an interest set of OP_WRITE on that channel, log the time offset (System.nanoTime), and select(). Whenever this returns, we log the time and attempt to write an X kilobyte buffer to the socket, then select again, etc. We also log timings for time spent in write() and the number of bytes written in each call to write(). We'll retry this test for several values of X, getting a sense of how much data is copied from our input buffer to the socket's output buffer on each call to write(), and how long it takes select() to return, indicating space in the socket's send buffer.


First, here's the average time (over ~200 or so writes) spent in select():

1K 396011
2K 654
5K 846
10K 1271
100K 9332

First surprising result: The distribution is pretty much what you'd expect for a 100 mbit network, except for the 1K datapoint, which should just have made you spray coffee on your monitor. The 1K writes start off very fast for the first few samples (~500us), then hit a wall and only get woken up every 500000us, yielding a very, very slow transfer rate (2 KB/s). I initially thought this was due to Nagle preventing a small buffer to be sent before a timeout expired, but setting TCP_NODELAY on the socket had no effect on this behaviour. I can confirm using a packet dump that the server end immediately acks every packet we send to it, so it's not a question of the local TCP send window getting full.

The second weird result is that on Windows, whenever you call SocketChannel.write(ByteBuffer), THE ENTIRE BUFFER GETS COPIED OFF AND REPORTED AS WRITTEN. You'd expect it to write only as many bytes as it can until filling up the local TCP send buffer (sized at SO_SNDBUF, which defaults at 8 KB, as we all know), then return that number, leaving the rest of your input buffer to be copied out on the next call to write(). In fact, that's my understanding of the Sun documentation (emphasis mine):

Writes a sequence of bytes to this channel from the given buffer.

An attempt is made to write up to r bytes to the channel, where r is the number of bytes remaining in the buffer, that is, dst.remaining(), at the moment this method is invoked.

Suppose that a byte sequence of length n is written, where 0 <= n <= r. This byte sequence will be transferred from the buffer starting at index p, where p is the buffer's position at the moment this method is invoked; the index of the last byte written will be p + n - 1. Upon return the buffer's position will be equal to p + n; its limit will not have changed.

Unless otherwise specified, a write operation will return only after writing all of the r requested bytes. Some types of channels, depending upon their state, may write only some of the bytes or possibly none at all. A socket channel in non-blocking mode, for example, cannot write any more bytes than are free in the socket's output buffer.

Open Questions (Mystery!)

So, I'm left with two big questions:

1. What's going on with the 1K writes? I tried TCP_NODELAY on that socket (Nagle's algorithm being the obvious culprit when small writes have huge latency), with no change: select() only wakes up once per 500ms. Also, it happens consistently on every single select. Since the local SO_SNDBUF is 8K, even if there was something fishy going on around that 500ms pause in select(), shouldn't you only get blocked for the full 500ms once per 8 writes? I've never seen this happen in a real-world production app though, so I'm willing to chalk it up to a quirk in my simplistic test code.

2. Isn't it a bit strange that write() returns immediately and always reports writing the full buffer under Windows, even if you pass in a 100MB ByteBuffer to be output with an SO_SNDBUF of only 8K? On 2 UNIX systems I tried it on, it still wrote much, much more than the value of SO_SNDBUF, but the results were all over the place, they didn't always match the size of the input array (as I'd expect).

No comments: