Web Server Performance Testing

I started to run some performance tests on heliod to compare it with a handful of other web servers out there. I’ll publish results in the next article once the runs complete.

Now, heliod shines brightest on large system with multiple CPUs of many cores. Unfortunately I don’t own such hardware so I’m testing on a very low-end system I have available, but it should still be interesting.

One of the many challenges of benchmarking is deciding what to test. All performance tests are to some extent artificial, ultimately all that matters is how it works in production with the actual customer traffic over the production network.

For these runs I chose to measure static file performance with files of size 8K.

One of my pet peeves are articles which show someone’s performance run results without any details as to how they were measured. So to avoid doing that myself, below are all the details on how I’m running these tests. In addition, I will publish the driver source and data set so you can run the same tests on your machine if you like.

Client

A trustworthy load generator client is vital for performance testing, something too often overlooked (I can’t believe anyone is still using ‘ab’, for instance!). If the client can’t scale or in other ways introduces limitations the resulting numbers will be meaningless because they reflect the limits of the client not those of the server being tested.

I’m using faban for these tests.

I’m running faban on my Open Indiana machine which has an AMD Phenom II X4 (quad core) 925 CPU and 8GB RAM.

Server

I’m running the various web servers under test on a small Linux box. It would be fun to test on a server with lots of cores but this is what I have available:

  • CPU: Intel Atom D510 (1.66GHz, 2 cores, 4 hardware threads)
  • RAM: 4GB
  • OS: Debian 6.0.6 32bit

Network

Both machines have gigabit ethernet and are connected to the same gigabit switch.

As an aside, there seems to be a trend of testing servers via localhost (that is, with the load generator on the same machine connecting to http://localhost). Doing so is a mistake that will report meaningless numbers. Your customers won’t be connecting to the server on localhost, so your load generator shouldn’t either.

Software Configuration

Another important decision for benchmarking is how to tune the software stack. For this round of runs I am running out-of-the-box default configurations for everything. This means the results will not be optimized. I’m sure most, if not all, the web servers being tested could score higher if their configuration is tuned to the very specific requirements of this particular test and hardware environment. Why test default configurations? A few reasons:

  • Baseline: I expect I’ll run more optimized configurations later on, so it is nice to have a baseline to compare how much future tuning helps.
  • Reality: Truth is, lots of servers do get deployed into production with default configurations. So while the results are not optimal, they are relevant to everyone that has not taken the time to tune their server configuration.
  • Fairness: I am familiar with performance tuning only a couple of the web servers I’m testing here. If I tune those well and the other ones badly I’ll influence the results in favor of the ones I know. So to be fair, I won’t tune any of them.

Software Versions

Whenever possible, I installed the Debian package for each server.

  • apache2-mpm-event                2.2.16-6+squeeze8
  • apache2-mpm-worker              2.2.16-6+squeeze8
  • lighttpd                                     1.4.28-2+squeeze1
  • nginx                                         0.7.67-3+squeeze2
  • cherokee                                  1.0.8-5+squeeze1
  • monkey                                    0.9.3-1
  • varnish                                      2.1.3-8
  • g-wan from http://gwan.com/archives/gwan_linux32-bit.tar.bz2
  • heliod 0.2 from http://sourceforge.net/projects/heliod/files/release-0.2/

Operation

The server has 1000 files, each file is different and each one is 8K in size. The faban driver requests a random file out of the 1000 each time.

There is no client think time between requests. That is, each client thread will load up the server as fast as the server can respond.

For each server I am doing ten runs, starting with 1 concurrent client (sequential requests) up to 10 concurrent clients. Keep in mind the server only has 1 CPU with 4 hardware threads in two cores, so one should expect the throughput to scale up from 1 to 4 concurrent clients and start to taper off after that.

Each run is 30 minutes long. This allows a bit of time to see if throughput remains consistent for a given load level. Each run is preceded by a one minute warmup time.

heliod and CRIME

CRIME is an interesting approach to leak information being protected by SSL/TLS.  It is an easy to understand/explain example of why security issues are nearly always more complex than they seem!

heliod uses NSS, which by default has TLS compression disabled so it is not vulnerable.

Web Server

If you were to look at the HTTP response headers from this site, you’d see it is being handled by:

Server: heliod-web-server/0.1

Which is a web server you’ve probably never heard of before… Or I should say, you most likely have, but with various different names.

Way back when, this was the Netscape Enterprise Server. Which later became iPlanet Web Server (during the Sun|Netscape alliance). Under Sun alone, it was renamed several times to SunONE Web Server and Sun Java System Web Server (and maybe some other name variants I forget now). Naming nonsense aside, it’s been the same evolving code base all along, best known for high performance and even higher scalability.

Thankfully, Sun open sourced the code in 2009 under the BSD license. Most of it, anyway. Unfortunately a few parts were left out, mainly the administration support, installer and the embedded Java servlet engine. The open source code was kept in sync with the commercial releases until January 2010 (7.0 update 8, using the commercial release version numbering). After that, the open source repository has not seen any activity (not coincidentally, January 2010 was also when Oracle acquired Sun, so this is not surprising).

Surprisingly, the source repository is still available:

hg clone ssh://anon@hg.opensolaris.org/hg/webstack/webserver

The source as published can be tricky to build and it does not produce an installable package. When I was setting up this site last year I ended up forking this code into http://sourceforge.net/projects/heliod/. The code is the same but I added a rudimentary install script to make it easier to get going. You can download binaries for Solaris (x86) and Linux from the sourceforge page so you don’t have to build it yourself if you prefer not to.

(Update: The source is now in github here: https://github.com/jvirkki/heliod)

 

Joyent Debacle

I’ve been hosting this server on Joyent for a while now for a few reasons. One was that their VMs are Solaris zones which is (a) cool, (b) I prefer hosting Internet-facing servers on Solaris and (c) some of the Sun talent moved to Joyent after the oracle disaster so I liked the idea of supporting Joyent. The other reason was that Joyent offered a fixed-price-for-life server when I signed up so it was a nice deal as well.

Yesterday Joyent broke their promises by dropping all the fixed & lifetime plans out of the blue. There’s been coverage on Slashdot, Network World, ZDNet and plenty other places. Discussion rages on at the support forum and there is a google group dedicated to finding alternatives as well.

The people who prepaid for a lifetime plan were the hardest hit. For me it is not that bad as I was on a monthly plan (not prepaid) but they did still break the promise of a maintaining a fixed price for life.

I could migrate to their new plans, which are about 50% more expensive per month but it seems hard to justify why I should trust these people anymore. So I’ll probably migrate this server elsewhere once I do some research to find something more trustworthy than Joyent. The sad part is it probably won’t be Solaris ;-(

 

Bike to Work 2012

I meant to post this last month…

For years I’ve meant to participate in Bike to Work day. That distance for me isnt’ that long (about 35 miles one way) but there is the small matter of having to cross the Santa Cruz Mountains to get from the Santa Cruz area over to Silicon Valley. And of course, doing it again at night to get back home!

This year I decided to go for it (I’m signed up for the Levi’s GranFondo in September, so need to start getting some training mileage in!). Going to work wasn’t bad at all, I’m used to climbing Mt. Charlie and from the summit the rest of the way was all down hill or flat! Easy ride. I could do this often!

Coming back home was a lot tougher. When I got to Lexington Reservoir I had already done about 50 miles that day and was starting to get tired but all the climbing was still ahead of me! By the time I got to the summit I was beat and it was completely dark. Fortunately I had borrowed some powerful bike lights from my neighbor so I had plenty of illumination.

All in all it was fun, I should do this more often. Here is the data from my bike stats and also on strava.

 

Bloom filter vs. CPU cache

I was playing around with a bloom filter today and drawing some graphs on performance vs. various metrics. While for my real use case the filter bit array is larger, just for fun I wanted to look at how the performance changes as the array size exceeds various CPU caches.

I’m running on an AMD Phenom II X4 925 Processor which has 64K L1 cache (per core), 512K L2 cache (per core) and 6MB L3 cache (shared). The bloom filter code is single threaded.

The following graph shows the time (yellow line) taken to insert 10 million entries into the bloom filter as the size of the bit array (red line) increases linearly (the two lines are not on the same y-axis scale). For the first two-thirds of the graph the time taken is just over 2 seconds, or just under 5 million elements per second. The sudden change in the slope of the line is near 512K at which point the array no longer fits the L2 cache.

The next graph zooms out to show the size of the bit array increasing all the way to 18MB. The second slope change (about a third of the way from the left)  is in the neighborhood of 6MB, corresponding to the L3 cache size.

 

Bandwidth throttling with faban

I often use faban for performance related work. Nearly always I have used it while working on APIs which are called by other servers (as opposed to humans, who linger between mouse clicks) and where bandwidth use is not a significant factor (the processing time of the request outweighs the request/response time by orders of magnitude). For these requirements it has always worked well to run the faban tests with zero think time and letting it issue requests as fast as the server can handle.

Recently, however, I’ve been looking into a system where the request and/or response bodies are quite large, so the bulk of the total request time is consumed by the data transmission over the network. This creates a bit of a problem because in the lab the faban machine and the server being tested (“SUT”) are wired together via gigabit ethernet so there is a decent amount of bandwidth between them. While that sounds like a good problem to have, the reality is that in production the end users are coming in over the internet and have far lower bandwidth.

Thus, the testing is not very realistic. Faban can saturate the server with just a few users uploading a gigabit speeds even though I know the server can handle far more users when each one is uploading at much slower speeds over the internet.

Turns out faban has the capability to throttle the upload and/or download bandwidth over a given socket. As far as I could find this is not documented anywhere, I found it by accident while looking at the code when I was considering various solutions.

Here’s one way (there may be other ways) to use it:

ctx = DriverContext.getContext();
com.sun.faban.driver.engine.DriverContext engine =
    (com.sun.faban.driver.engine.DriverContext)ctx;

// Set desired speed in K per second, or -1 to disable throttling
engine.setUploadSpeed(uploadKBps);
engine.setDownloadSpeed(downloadKBps);

As of this writing the latest faban version is 1.0.2. In this version the upload throttling works fine but downloads (i.e. reading the response body) can hang if throttling is enabled. I filed a bug with a fix that is working reliably for me. If you try this with 1.0.2 (or earlier, probably) then you’ll need to apply that change and rebuild faban.

 

What is your cache hit rate?

While this may sound like an obvious metric to check, I’m often seeing that developers don’t verify the cache hit rate on their code under realistic conditions. The end result is a server which performs worse than if it had no cache at all.

We all know the benefits of keeping a local cache.. relatively cheap to keep and it saves having to make more expensive calls to obtain the data from wherever it ultimately resides. Just don’t forget that keeping that cache, while cheap, takes non-zero CPU and memory resources. The code must get more benefit from it than the cost of maintaining the cache, otherwise it is a net loss.

I was recently reviewing a RESTful service which kept a cache of everything it processed. The origin retrieval was relatively expensive so this seemed like a good idea. However, given the size of the objects being processed vs. the amount of RAM allocated to this service in production, my first question was what’s the cache hit rate?

Developers didn’t know, but felt that as long as it saves any back-end hit it must help, right?

A good rule of thumb is that anything that isn’t being measured is probably misbehaving… and this turned out to be no exception.

Testing under some load (I like faban for this) showed the server was allocating and quickly garbage collecting tens of thousands of buffer objects per second for the cache. Hit rate you ask? Zero!

Merely commenting out the cache gave a quick 10% boost in overall throughput.

So that’s my performance tip of the day.. be aware of you cache hit rates!

The forgotten axis of scalability

I have written variants of this article before… I find it is such a recurring topic that maybe it is worth a revisit once again.

Back in the day, bumming instructions out of your assembly code was the thing to do to gain a few more CPU cycles here and there. It was very time consuming work but computers were expensive and very, very slow so the performance gains were worth it. It was great fun, but it’s hasn’t been cost effective for a couple decades now.

In the 90’s, software performance became a forgotten art for the most part. With the MHz (later, GHz) wars in full swing, it was a given that CPUs would be twice as fast by the time you released the code, so why bother with any performance optimizations! As long as it was adequate on your development box, it would be plenty fast later.

In the 2000’s the CPU frequency race was slowing down and Internet scale was speeding up. Up to a point you could buy bigger servers to keep up but that was quite expensive and only got you so far. No matter how much budget you had, at some point faster servers were not going to cut it anymore so you had to scale sideways instead. And thus, the obvious conclusion was to skip the expensive server part altogether and scale horizontally on cheaper hardware from day one. Remember the buzz in the earlier part of the decade about google having 10K servers? (Seems like such a small number now!)

It became a point of pride to have as many servers as possible and, once again, improving code performance was not seen as a good use of time when you can always throw another cheap box (or another hundred, who’s counting?) at the problem to compensate.

There’s nothing to argue with the basic premises of these trends. It was true that CPUs were getting faster all the time and it is true that scaling horizontally on commodity hardware is the way to go. And it is also very true that intensive code optimization is hardly ever worth the effort and opportunity cost of not doing something else.

(Back in Sun in the Web Server team we did spend a fair amount of time on such intense optimization work, looking for a few percent here and a few percent there. The goal was to be able to post world record SPECweb numbers (one example here). While fun, it was an exercise driven by marketing not so much the needs of data centers. For most platform vendors, such an effort is not worth the cost. For companies offering services as opposed to products, it’s basically never worth the cost.)

The end result of all this, however, is that the concept of writing faster code and architecting for performance seems to have been lost! I’ve been seeing this for years now and if anything the trend is becoming more prevalent. The idea of scaling with more boxes from day one is so ingrained that I rarely see teams doing some basic performance sanity checking first.

More servers do cost more money. Not just to buy, but particularly to run, cool and house in a rack. If you can get by with a few less servers, that’s not a bad thing. If you can get by with a lot fewer servers, all those operating costs go straight to your profit margin. Not a bad deal!

If you read the popular book Art of Scalability from a few years back you’ll be familiar with their three axis of scalability (more boxes horizontally, split by service, shard by customer). I note with amusement they forgot the easiest and cheapest axis of scalability, which is to write more performant code in the first place…

The usual argument goes that efficient design and code is not worth it because the gains, while real, are small enough that they are lost in the noise and you’ll still need about the same number of boxes anyway so why bother? That’s usually true IF you’re starting from a reasonably optimized design and implementation. However, if the development team has not been running realistic load testing and performance analysis all along, I can pretty much guarantee there are gains to be had that’ll save you quite a bit in operating costs.

Enough philosophizing, how about a real world example…

When I started at my current position I took over one of the core production REST services. It was (and is) a very standard setup… REST APIs, Java Servlets, JAX-RS, MySQL. The usual. Response times were plenty adequate although not stellar. About a year ago response times started climbing as our user base keeps growing every month. While it was still doing fine, comparing the usage growth curves to the response times curves made it clear it was time to order some more hardware soon to spread the load a bit before it slowed down enough that customers would notice.

Meanwhile though, I had been working on sanitizing the performance. Long story short, I never did order more hardware. Just the opposite… after I upgraded the code, about 50% of the hardware dedicated to this service became available to reassign to other things, it simply wasn’t needed anymore given the increased capacity of the new code.

The new code can handle just about 40 times more throughput per server (not 40 percent, 40 times!). When fully loaded (at max capacity) it now maintains mean service times in the 8ms to 10ms range. The previous version had mean response times in the 100ms to 150ms range even though it was handling less than 1/40th of the load!

These may be commodity boxes, but buying 40 of them still takes some cash. And the monthly operating cost of 40 boxes is real money as well. Think about it, it means that roughly a rack full of 1U servers can be downsized to a single server…

I’d love to be able to boast about having done some extreme performance magic to get these scalability benefits, but the reality is all I did was some basic design and implementation optimizations across the board, grabbing the low-hanging performance gains here and there. Such gains add up and so by the time I was done the system could handle 40 times as much traffic (customer requests handled per second).

Why wouldn’t you do this level of performance sanity checking? It doesn’t take that much extra work to design and implement for scalability and the end result is a competitive advantage.

Rails Not Finding a File

I came across an interesting bug today… a Ruby on Rails application was consistently failing on a test machine, displaying an “uninitialized constant” error message. It was failing because it had not loaded an .rb file which contained the definition of the class being used. That seems straightforward enough, except that:

  • The file that was not being loaded was in a directory among other .rb files, all of which did get loaded, so why only this one file?
  • The application worked perfectly on the developer’s machine.
  • The application was also working on another test server.

The file was there in all the installations, no access permissions issues, it was complete, readable and identical everywhere. So no issues with the file itself. Just that the runtime apparently sometimes was able to load it, sometimes not, depending on which machine it ran in.

The only thing different about this file is that its name was capitalized, whereas all other rb files there had all lowercase names. Then I found out the developer was running the application on a Mac and my test machine is a Linux box. As soon as I heard that, I remembered that the Mac HFS filesystem is a bit wacky about case. While it preserves case, it doesn’t observe it. The following surprising sequence actually works on a Mac:

% echo Hello > Hello
% cat hello
Hello

Problem solved, I thought! To confirm, I strace’d the application and indeed, it was opening “file.rb” even though the disk file was called “File.rb”. So it fails, except on a Mac where that works. That explains everything!

Except… the other test server that had been set up, the one where the application also works, is also a Linux box! How can it possibly be working there?

Using strace showed that there it was calling open() on “File.rb”, so it worked. But why?

After more closely reviewing the strace output, I noticed that on the working Linux box, the process was open()ing the directory, reading the list of files and then opening and reading each one in turn. So because it got the actual name of the file (“File.rb”), it was able to open it. On the Linux box where the application did not work, it never opened the directory entry, it went straight to attempting to open “file.rb” which of course failed. Ok that explains why the discrepancy in the file name being opened, but why is one reading the directory and the other one is not? Both machines have identical installations of all relevant software!

I then noticed that on the Linux box where it worked, the application was being run with the “production” environment flag of rails (-e). On the Linux box where it did not work, it wasn’t.

After some more digging, I discovered that the production.rb sets:

config.cache_classes = true

This cache_classes is defined as follows:

config.cache_classes controls whether or not application classes and modules should be reloaded on each request. Defaults to true in development mode, and false in test and production modes.

Aha! So looks like the way it is implemented is that if cache_classes is true, it scans the directories at startup and loads (and caches) all the .rb files it finds, which is what the strace output showed. Thus, it finds and loads “File.rb”. If cache_classes is false, it never scans the directory, simply attempts to (re)load “file.rb” each time, always failing.

With that, mystery truly solved! If you run into this, now you know…