The fastest web server is …

A few days ago I mentioned that I had started doing some static file performance runs on a handful of web servers. Here are the results!

Please refer to the previous article on setup details for info on exactly what and how I’m testing this time. Benchmark results apply to the narrow use case being tested and this one is no different.

The results are ordered from slowest to fastest peak numbers produced by each server.

9. Monkey (0.9.3-1 debian package)

The monkey server was not able to complete the runs so it is not included in the graphs. At a concurrency of 1, it finished the 30 minute run with an average of only 133 requests/second, far slower than any of the others. With only two concurrent clients it started erroring out on some requests so I stopped testing it. Looks like this one is not ready for prime time yet.

8. G-WAN (3.3.28)

I had seen some of the performance claims around G-WAN so decided to try it, even though it is not open source. All the tests I’d seen of it had been done running on localhost so I was curious to see how it behaves under a slightly more realistic type of load. Turns out, not too well. Aside from monkey, it was the slowest of the group.

7. Apache HTTPD, Event MPM (2.2.16-6+squeeze8 debian package)

I was surprised to see the event MPM do so badly. To be fair, it should do better against a benchmark which has large numbers of mostly idle clients which is not what this particular benchmark tested. At most points in the test it was also the highest consumer of CPU.

6. lighttpd (1.4.28-2+squeeze1 debian package)

Here we get to the first of the serious players (for this test scenario). lighttpd starts out strong up to 3 concurrent clients. After that it stops scaling up so it loses some ground in the final results. Also, lighttpd is the lightest user of CPU of the
group.

5. nginx (0.7.67-3+squeeze2 debian package)

The nginx throughput curve is just about identical to lighttpd, just shifted slightly higher. The CPU consumption curve is also almost identical. These two are twins separated at birth. While nginx uses a tiny bit more CPU than lighttpd, it makes up for it with higher throughput.

4. cherokee (1.0.8-5+squeeze1 debian package)

Cherokee just barely edges out nginx at the higher concurrencies tested so it ends up fourth. To be fair, nginx was faster than cherokee at most of the lower concurrencies though. Note, however, that cherokee uses quite a bit more CPU to deliver its numbers so it is not as efficient as nginx.

3. Apache HTTPD, Worker MPM (2.2.16-6+squeeze8 debian package)

Apache third, really? Yes but only because this ranking is based on the peak numbers of each server. With worker mpm, apache starts out quite a bit behind lighttpd/nginx/cherokee at lower client concurrencies. However, as those others start to stall as concurrency increases, apache keeps going higher. Around five concurrent clients it catches up to lighttpd and around eight clients it catches up to nginx and cherokee. At ten
it scores a throughput just slightly above those two, securing third place in this test. Looking at CPU usage tough, at that point it has just about maxed out the CPU (about 1% idle) making it the highest CPU consumer of this group so it is not very efficient.

2. varnish (2.1.3-8 debian package)

Varnish is not really a web server, of course, so in that sense it is out of place in this test. But it can serve (cached) static files and has been included in other similar performance tests so I decided to include it here.

Varnish throughput starts out quite a bit slower than nginx, right on par with lighttpd and cherokee and lower concurrencies. However, varnish scales up beautifully. Unlike all the previous servers, its throughput curve does not flatten out as concurrency increases in this
test, it keeps going higher. Around four concurrent users it surpasses nginx and only keeps going higher all the way to ten.

Varnish was able to push network utilization to 90-94%. The only drawback is that delivering its performance does use up a lot of CPU… only Apache used more CPU than varnish in this test. At ten clients, there is only 9% idle CPU left.

1. heliod (0.2)

heliod had the highest throughput at every point tested in these runs. It is slightly faster than nginx at sequential requests (one client) and then pulls away.

heliod is also quite efficient in CPU consumption. Up to four concurrent clients it is the lightest user of CPU cycles even though it produced higher throughput than all the others. At higher concurrencies, it used slightly more CPU than nginx/lighttpd although it makes up for it with far higher throughput.

heliod was also the only server able to saturate the gigabit connection (at over 97% utilization). Given that there is 62% idle CPU left at that point, I suspect if I had more bandwidth heliod might be able to score even higher on this machine.

These results should not be much of a surprise… after all heliod is not new, it is the same code that has been setting benchmark records for over ten years (it just wasn’t open source back then). Fast then, still fast today.

If you are running one of these web servers and using varnish to accelerate it, you could  switch to heliod by itself and both simplify your setup and gain performance at the same time. Food for thought!


All right, let’s see some graphs..

First, here is the overall throughput graph for all the servers tested:

As you can see the servers fall into three groups in terms of throughput:

  1. apache-event and g-wan are not competitive in this crowd
  2. apache-worker/nginx/lighttpd/cherokee are quite similar in the middle
  3. varnish and heliod are in a class of their own at the high end

Next graph shows the 90th percentile response time for each server. That is, 90 percent of all requests completed in this time or less. I left out apache-event and g-wan from the graph to avoid compressing the more interesting part of the graph:

The next graph shows CPU idle time (percent) for each server through the run. The spikes to 100% between each step are due to the short idle interval between each run as faban starts the next run.

The two apache variants (red and orange) are the only ones who maxed out the CPU. Varnish (light green) also uses quite a bit of CPU and comes close (9% idle). On the other side, lighttpd (dark red) and nginx (light blue) put the least load on the CPU with about 72% idle.

Finally, the next graph shows network utilization percentage of the gigabit interface:

Here heliod (blue) is the only one which manages to saturate the network, with varnish coming in quite close. None of the others manage to reach even 60% utilization.

So there you have it… heliod can sustain far higher throughput than any of the popular web servers in this static file test and it can do so efficiently, saturating the network on a low power two core machine while leaving plenty of CPU idle. It even manages to sustain higher throughput than varnish which specializes in caching static content efficiently and is not a full featured web server.

Of course, all benchmarks are by necessity artificial. If any of the variables change the numbers will change and the rankings may change. These results are representative of the exact use case and setup I tested, not necessarily of any other. Again, for details on what and how I tested, see my previous article.

I hope to test other scenarios in the future. I’d love to also test on a faster CPU with lots of cores, unfortunately I don’t own such hardware so it is unlikely to happen.

Finally, I set up a github repository fabhttp which contains:

  1. source code of the faban driver used to run these tests
  2. dstat/nicstat data collected during the runs (used to generate the graphs above)
  3. additional graphs generated by faban for every individual run

 

Web Server Performance Testing

I started to run some performance tests on heliod to compare it with a handful of other web servers out there. I’ll publish results in the next article once the runs complete.

Now, heliod shines brightest on large system with multiple CPUs of many cores. Unfortunately I don’t own such hardware so I’m testing on a very low-end system I have available, but it should still be interesting.

One of the many challenges of benchmarking is deciding what to test. All performance tests are to some extent artificial, ultimately all that matters is how it works in production with the actual customer traffic over the production network.

For these runs I chose to measure static file performance with files of size 8K.

One of my pet peeves are articles which show someone’s performance run results without any details as to how they were measured. So to avoid doing that myself, below are all the details on how I’m running these tests. In addition, I will publish the driver source and data set so you can run the same tests on your machine if you like.

Client

A trustworthy load generator client is vital for performance testing, something too often overlooked (I can’t believe anyone is still using ‘ab’, for instance!). If the client can’t scale or in other ways introduces limitations the resulting numbers will be meaningless because they reflect the limits of the client not those of the server being tested.

I’m using faban for these tests.

I’m running faban on my Open Indiana machine which has an AMD Phenom II X4 (quad core) 925 CPU and 8GB RAM.

Server

I’m running the various web servers under test on a small Linux box. It would be fun to test on a server with lots of cores but this is what I have available:

  • CPU: Intel Atom D510 (1.66GHz, 2 cores, 4 hardware threads)
  • RAM: 4GB
  • OS: Debian 6.0.6 32bit

Network

Both machines have gigabit ethernet and are connected to the same gigabit switch.

As an aside, there seems to be a trend of testing servers via localhost (that is, with the load generator on the same machine connecting to http://localhost). Doing so is a mistake that will report meaningless numbers. Your customers won’t be connecting to the server on localhost, so your load generator shouldn’t either.

Software Configuration

Another important decision for benchmarking is how to tune the software stack. For this round of runs I am running out-of-the-box default configurations for everything. This means the results will not be optimized. I’m sure most, if not all, the web servers being tested could score higher if their configuration is tuned to the very specific requirements of this particular test and hardware environment. Why test default configurations? A few reasons:

  • Baseline: I expect I’ll run more optimized configurations later on, so it is nice to have a baseline to compare how much future tuning helps.
  • Reality: Truth is, lots of servers do get deployed into production with default configurations. So while the results are not optimal, they are relevant to everyone that has not taken the time to tune their server configuration.
  • Fairness: I am familiar with performance tuning only a couple of the web servers I’m testing here. If I tune those well and the other ones badly I’ll influence the results in favor of the ones I know. So to be fair, I won’t tune any of them.

Software Versions

Whenever possible, I installed the Debian package for each server.

  • apache2-mpm-event                2.2.16-6+squeeze8
  • apache2-mpm-worker              2.2.16-6+squeeze8
  • lighttpd                                     1.4.28-2+squeeze1
  • nginx                                         0.7.67-3+squeeze2
  • cherokee                                  1.0.8-5+squeeze1
  • monkey                                    0.9.3-1
  • varnish                                      2.1.3-8
  • g-wan from http://gwan.com/archives/gwan_linux32-bit.tar.bz2
  • heliod 0.2 from http://sourceforge.net/projects/heliod/files/release-0.2/

Operation

The server has 1000 files, each file is different and each one is 8K in size. The faban driver requests a random file out of the 1000 each time.

There is no client think time between requests. That is, each client thread will load up the server as fast as the server can respond.

For each server I am doing ten runs, starting with 1 concurrent client (sequential requests) up to 10 concurrent clients. Keep in mind the server only has 1 CPU with 4 hardware threads in two cores, so one should expect the throughput to scale up from 1 to 4 concurrent clients and start to taper off after that.

Each run is 30 minutes long. This allows a bit of time to see if throughput remains consistent for a given load level. Each run is preceded by a one minute warmup time.

heliod and CRIME

CRIME is an interesting approach to leak information being protected by SSL/TLS.  It is an easy to understand/explain example of why security issues are nearly always more complex than they seem!

heliod uses NSS, which by default has TLS compression disabled so it is not vulnerable.

Web Server

If you were to look at the HTTP response headers from this site, you’d see it is being handled by:

Server: heliod-web-server/0.1

Which is a web server you’ve probably never heard of before… Or I should say, you most likely have, but with various different names.

Way back when, this was the Netscape Enterprise Server. Which later became iPlanet Web Server (during the Sun|Netscape alliance). Under Sun alone, it was renamed several times to SunONE Web Server and Sun Java System Web Server (and maybe some other name variants I forget now). Naming nonsense aside, it’s been the same evolving code base all along, best known for high performance and even higher scalability.

Thankfully, Sun open sourced the code in 2009 under the BSD license. Most of it, anyway. Unfortunately a few parts were left out, mainly the administration support, installer and the embedded Java servlet engine. The open source code was kept in sync with the commercial releases until January 2010 (7.0 update 8, using the commercial release version numbering). After that, the open source repository has not seen any activity (not coincidentally, January 2010 was also when Oracle acquired Sun, so this is not surprising).

Surprisingly, the source repository is still available:

hg clone ssh://anon@hg.opensolaris.org/hg/webstack/webserver

The source as published can be tricky to build and it does not produce an installable package. When I was setting up this site last year I ended up forking this code into http://sourceforge.net/projects/heliod/. The code is the same but I added a rudimentary install script to make it easier to get going. You can download binaries for Solaris (x86) and Linux from the sourceforge page so you don’t have to build it yourself if you prefer not to.

(Update: The source is now in github here: https://github.com/jvirkki/heliod)

 

Joyent Debacle

I’ve been hosting this server on Joyent for a while now for a few reasons. One was that their VMs are Solaris zones which is (a) cool, (b) I prefer hosting Internet-facing servers on Solaris and (c) some of the Sun talent moved to Joyent after the oracle disaster so I liked the idea of supporting Joyent. The other reason was that Joyent offered a fixed-price-for-life server when I signed up so it was a nice deal as well.

Yesterday Joyent broke their promises by dropping all the fixed & lifetime plans out of the blue. There’s been coverage on Slashdot, Network World, ZDNet and plenty other places. Discussion rages on at the support forum and there is a google group dedicated to finding alternatives as well.

The people who prepaid for a lifetime plan were the hardest hit. For me it is not that bad as I was on a monthly plan (not prepaid) but they did still break the promise of a maintaining a fixed price for life.

I could migrate to their new plans, which are about 50% more expensive per month but it seems hard to justify why I should trust these people anymore. So I’ll probably migrate this server elsewhere once I do some research to find something more trustworthy than Joyent. The sad part is it probably won’t be Solaris ;-(

 

Bike to Work 2012

I meant to post this last month…

For years I’ve meant to participate in Bike to Work day. That distance for me isnt’ that long (about 35 miles one way) but there is the small matter of having to cross the Santa Cruz Mountains to get from the Santa Cruz area over to Silicon Valley. And of course, doing it again at night to get back home!

This year I decided to go for it (I’m signed up for the Levi’s GranFondo in September, so need to start getting some training mileage in!). Going to work wasn’t bad at all, I’m used to climbing Mt. Charlie and from the summit the rest of the way was all down hill or flat! Easy ride. I could do this often!

Coming back home was a lot tougher. When I got to Lexington Reservoir I had already done about 50 miles that day and was starting to get tired but all the climbing was still ahead of me! By the time I got to the summit I was beat and it was completely dark. Fortunately I had borrowed some powerful bike lights from my neighbor so I had plenty of illumination.

All in all it was fun, I should do this more often. Here is the data from my bike stats and also on strava.

 

Bloom filter vs. CPU cache

I was playing around with a bloom filter today and drawing some graphs on performance vs. various metrics. While for my real use case the filter bit array is larger, just for fun I wanted to look at how the performance changes as the array size exceeds various CPU caches.

I’m running on an AMD Phenom II X4 925 Processor which has 64K L1 cache (per core), 512K L2 cache (per core) and 6MB L3 cache (shared). The bloom filter code is single threaded.

The following graph shows the time (yellow line) taken to insert 10 million entries into the bloom filter as the size of the bit array (red line) increases linearly (the two lines are not on the same y-axis scale). For the first two-thirds of the graph the time taken is just over 2 seconds, or just under 5 million elements per second. The sudden change in the slope of the line is near 512K at which point the array no longer fits the L2 cache.

The next graph zooms out to show the size of the bit array increasing all the way to 18MB. The second slope change (about a third of the way from the left)  is in the neighborhood of 6MB, corresponding to the L3 cache size.

 

Bandwidth throttling with faban

I often use faban for performance related work. Nearly always I have used it while working on APIs which are called by other servers (as opposed to humans, who linger between mouse clicks) and where bandwidth use is not a significant factor (the processing time of the request outweighs the request/response time by orders of magnitude). For these requirements it has always worked well to run the faban tests with zero think time and letting it issue requests as fast as the server can handle.

Recently, however, I’ve been looking into a system where the request and/or response bodies are quite large, so the bulk of the total request time is consumed by the data transmission over the network. This creates a bit of a problem because in the lab the faban machine and the server being tested (“SUT”) are wired together via gigabit ethernet so there is a decent amount of bandwidth between them. While that sounds like a good problem to have, the reality is that in production the end users are coming in over the internet and have far lower bandwidth.

Thus, the testing is not very realistic. Faban can saturate the server with just a few users uploading a gigabit speeds even though I know the server can handle far more users when each one is uploading at much slower speeds over the internet.

Turns out faban has the capability to throttle the upload and/or download bandwidth over a given socket. As far as I could find this is not documented anywhere, I found it by accident while looking at the code when I was considering various solutions.

Here’s one way (there may be other ways) to use it:

ctx = DriverContext.getContext();
com.sun.faban.driver.engine.DriverContext engine =
    (com.sun.faban.driver.engine.DriverContext)ctx;

// Set desired speed in K per second, or -1 to disable throttling
engine.setUploadSpeed(uploadKBps);
engine.setDownloadSpeed(downloadKBps);

As of this writing the latest faban version is 1.0.2. In this version the upload throttling works fine but downloads (i.e. reading the response body) can hang if throttling is enabled. I filed a bug with a fix that is working reliably for me. If you try this with 1.0.2 (or earlier, probably) then you’ll need to apply that change and rebuild faban.

 

What is your cache hit rate?

While this may sound like an obvious metric to check, I’m often seeing that developers don’t verify the cache hit rate on their code under realistic conditions. The end result is a server which performs worse than if it had no cache at all.

We all know the benefits of keeping a local cache.. relatively cheap to keep and it saves having to make more expensive calls to obtain the data from wherever it ultimately resides. Just don’t forget that keeping that cache, while cheap, takes non-zero CPU and memory resources. The code must get more benefit from it than the cost of maintaining the cache, otherwise it is a net loss.

I was recently reviewing a RESTful service which kept a cache of everything it processed. The origin retrieval was relatively expensive so this seemed like a good idea. However, given the size of the objects being processed vs. the amount of RAM allocated to this service in production, my first question was what’s the cache hit rate?

Developers didn’t know, but felt that as long as it saves any back-end hit it must help, right?

A good rule of thumb is that anything that isn’t being measured is probably misbehaving… and this turned out to be no exception.

Testing under some load (I like faban for this) showed the server was allocating and quickly garbage collecting tens of thousands of buffer objects per second for the cache. Hit rate you ask? Zero!

Merely commenting out the cache gave a quick 10% boost in overall throughput.

So that’s my performance tip of the day.. be aware of you cache hit rates!

The forgotten axis of scalability

I have written variants of this article before… I find it is such a recurring topic that maybe it is worth a revisit once again.

Back in the day, bumming instructions out of your assembly code was the thing to do to gain a few more CPU cycles here and there. It was very time consuming work but computers were expensive and very, very slow so the performance gains were worth it. It was great fun, but it’s hasn’t been cost effective for a couple decades now.

In the 90’s, software performance became a forgotten art for the most part. With the MHz (later, GHz) wars in full swing, it was a given that CPUs would be twice as fast by the time you released the code, so why bother with any performance optimizations! As long as it was adequate on your development box, it would be plenty fast later.

In the 2000’s the CPU frequency race was slowing down and Internet scale was speeding up. Up to a point you could buy bigger servers to keep up but that was quite expensive and only got you so far. No matter how much budget you had, at some point faster servers were not going to cut it anymore so you had to scale sideways instead. And thus, the obvious conclusion was to skip the expensive server part altogether and scale horizontally on cheaper hardware from day one. Remember the buzz in the earlier part of the decade about google having 10K servers? (Seems like such a small number now!)

It became a point of pride to have as many servers as possible and, once again, improving code performance was not seen as a good use of time when you can always throw another cheap box (or another hundred, who’s counting?) at the problem to compensate.

There’s nothing to argue with the basic premises of these trends. It was true that CPUs were getting faster all the time and it is true that scaling horizontally on commodity hardware is the way to go. And it is also very true that intensive code optimization is hardly ever worth the effort and opportunity cost of not doing something else.

(Back in Sun in the Web Server team we did spend a fair amount of time on such intense optimization work, looking for a few percent here and a few percent there. The goal was to be able to post world record SPECweb numbers (one example here). While fun, it was an exercise driven by marketing not so much the needs of data centers. For most platform vendors, such an effort is not worth the cost. For companies offering services as opposed to products, it’s basically never worth the cost.)

The end result of all this, however, is that the concept of writing faster code and architecting for performance seems to have been lost! I’ve been seeing this for years now and if anything the trend is becoming more prevalent. The idea of scaling with more boxes from day one is so ingrained that I rarely see teams doing some basic performance sanity checking first.

More servers do cost more money. Not just to buy, but particularly to run, cool and house in a rack. If you can get by with a few less servers, that’s not a bad thing. If you can get by with a lot fewer servers, all those operating costs go straight to your profit margin. Not a bad deal!

If you read the popular book Art of Scalability from a few years back you’ll be familiar with their three axis of scalability (more boxes horizontally, split by service, shard by customer). I note with amusement they forgot the easiest and cheapest axis of scalability, which is to write more performant code in the first place…

The usual argument goes that efficient design and code is not worth it because the gains, while real, are small enough that they are lost in the noise and you’ll still need about the same number of boxes anyway so why bother? That’s usually true IF you’re starting from a reasonably optimized design and implementation. However, if the development team has not been running realistic load testing and performance analysis all along, I can pretty much guarantee there are gains to be had that’ll save you quite a bit in operating costs.

Enough philosophizing, how about a real world example…

When I started at my current position I took over one of the core production REST services. It was (and is) a very standard setup… REST APIs, Java Servlets, JAX-RS, MySQL. The usual. Response times were plenty adequate although not stellar. About a year ago response times started climbing as our user base keeps growing every month. While it was still doing fine, comparing the usage growth curves to the response times curves made it clear it was time to order some more hardware soon to spread the load a bit before it slowed down enough that customers would notice.

Meanwhile though, I had been working on sanitizing the performance. Long story short, I never did order more hardware. Just the opposite… after I upgraded the code, about 50% of the hardware dedicated to this service became available to reassign to other things, it simply wasn’t needed anymore given the increased capacity of the new code.

The new code can handle just about 40 times more throughput per server (not 40 percent, 40 times!). When fully loaded (at max capacity) it now maintains mean service times in the 8ms to 10ms range. The previous version had mean response times in the 100ms to 150ms range even though it was handling less than 1/40th of the load!

These may be commodity boxes, but buying 40 of them still takes some cash. And the monthly operating cost of 40 boxes is real money as well. Think about it, it means that roughly a rack full of 1U servers can be downsized to a single server…

I’d love to be able to boast about having done some extreme performance magic to get these scalability benefits, but the reality is all I did was some basic design and implementation optimizations across the board, grabbing the low-hanging performance gains here and there. Such gains add up and so by the time I was done the system could handle 40 times as much traffic (customer requests handled per second).

Why wouldn’t you do this level of performance sanity checking? It doesn’t take that much extra work to design and implement for scalability and the end result is a competitive advantage.