The fastest web server is …

A few days ago I mentioned that I had started doing some static file performance runs on a handful of web servers. Here are the results!

Please refer to the previous article on setup details for info on exactly what and how I’m testing this time. Benchmark results apply to the narrow use case being tested and this one is no different.

The results are ordered from slowest to fastest peak numbers produced by each server.

9. Monkey (0.9.3-1 debian package)

The monkey server was not able to complete the runs so it is not included in the graphs. At a concurrency of 1, it finished the 30 minute run with an average of only 133 requests/second, far slower than any of the others. With only two concurrent clients it started erroring out on some requests so I stopped testing it. Looks like this one is not ready for prime time yet.

8. G-WAN (3.3.28)

I had seen some of the performance claims around G-WAN so decided to try it, even though it is not open source. All the tests I’d seen of it had been done running on localhost so I was curious to see how it behaves under a slightly more realistic type of load. Turns out, not too well. Aside from monkey, it was the slowest of the group.

7. Apache HTTPD, Event MPM (2.2.16-6+squeeze8 debian package)

I was surprised to see the event MPM do so badly. To be fair, it should do better against a benchmark which has large numbers of mostly idle clients which is not what this particular benchmark tested. At most points in the test it was also the highest consumer of CPU.

6. lighttpd (1.4.28-2+squeeze1 debian package)

Here we get to the first of the serious players (for this test scenario). lighttpd starts out strong up to 3 concurrent clients. After that it stops scaling up so it loses some ground in the final results. Also, lighttpd is the lightest user of CPU of the
group.

5. nginx (0.7.67-3+squeeze2 debian package)

The nginx throughput curve is just about identical to lighttpd, just shifted slightly higher. The CPU consumption curve is also almost identical. These two are twins separated at birth. While nginx uses a tiny bit more CPU than lighttpd, it makes up for it with higher throughput.

4. cherokee (1.0.8-5+squeeze1 debian package)

Cherokee just barely edges out nginx at the higher concurrencies tested so it ends up fourth. To be fair, nginx was faster than cherokee at most of the lower concurrencies though. Note, however, that cherokee uses quite a bit more CPU to deliver its numbers so it is not as efficient as nginx.

3. Apache HTTPD, Worker MPM (2.2.16-6+squeeze8 debian package)

Apache third, really? Yes but only because this ranking is based on the peak numbers of each server. With worker mpm, apache starts out quite a bit behind lighttpd/nginx/cherokee at lower client concurrencies. However, as those others start to stall as concurrency increases, apache keeps going higher. Around five concurrent clients it catches up to lighttpd and around eight clients it catches up to nginx and cherokee. At ten
it scores a throughput just slightly above those two, securing third place in this test. Looking at CPU usage tough, at that point it has just about maxed out the CPU (about 1% idle) making it the highest CPU consumer of this group so it is not very efficient.

2. varnish (2.1.3-8 debian package)

Varnish is not really a web server, of course, so in that sense it is out of place in this test. But it can serve (cached) static files and has been included in other similar performance tests so I decided to include it here.

Varnish throughput starts out quite a bit slower than nginx, right on par with lighttpd and cherokee and lower concurrencies. However, varnish scales up beautifully. Unlike all the previous servers, its throughput curve does not flatten out as concurrency increases in this
test, it keeps going higher. Around four concurrent users it surpasses nginx and only keeps going higher all the way to ten.

Varnish was able to push network utilization to 90-94%. The only drawback is that delivering its performance does use up a lot of CPU… only Apache used more CPU than varnish in this test. At ten clients, there is only 9% idle CPU left.

1. heliod (0.2)

heliod had the highest throughput at every point tested in these runs. It is slightly faster than nginx at sequential requests (one client) and then pulls away.

heliod is also quite efficient in CPU consumption. Up to four concurrent clients it is the lightest user of CPU cycles even though it produced higher throughput than all the others. At higher concurrencies, it used slightly more CPU than nginx/lighttpd although it makes up for it with far higher throughput.

heliod was also the only server able to saturate the gigabit connection (at over 97% utilization). Given that there is 62% idle CPU left at that point, I suspect if I had more bandwidth heliod might be able to score even higher on this machine.

These results should not be much of a surprise… after all heliod is not new, it is the same code that has been setting benchmark records for over ten years (it just wasn’t open source back then). Fast then, still fast today.

If you are running one of these web servers and using varnish to accelerate it, you could  switch to heliod by itself and both simplify your setup and gain performance at the same time. Food for thought!


All right, let’s see some graphs..

First, here is the overall throughput graph for all the servers tested:

As you can see the servers fall into three groups in terms of throughput:

  1. apache-event and g-wan are not competitive in this crowd
  2. apache-worker/nginx/lighttpd/cherokee are quite similar in the middle
  3. varnish and heliod are in a class of their own at the high end

Next graph shows the 90th percentile response time for each server. That is, 90 percent of all requests completed in this time or less. I left out apache-event and g-wan from the graph to avoid compressing the more interesting part of the graph:

The next graph shows CPU idle time (percent) for each server through the run. The spikes to 100% between each step are due to the short idle interval between each run as faban starts the next run.

The two apache variants (red and orange) are the only ones who maxed out the CPU. Varnish (light green) also uses quite a bit of CPU and comes close (9% idle). On the other side, lighttpd (dark red) and nginx (light blue) put the least load on the CPU with about 72% idle.

Finally, the next graph shows network utilization percentage of the gigabit interface:

Here heliod (blue) is the only one which manages to saturate the network, with varnish coming in quite close. None of the others manage to reach even 60% utilization.

So there you have it… heliod can sustain far higher throughput than any of the popular web servers in this static file test and it can do so efficiently, saturating the network on a low power two core machine while leaving plenty of CPU idle. It even manages to sustain higher throughput than varnish which specializes in caching static content efficiently and is not a full featured web server.

Of course, all benchmarks are by necessity artificial. If any of the variables change the numbers will change and the rankings may change. These results are representative of the exact use case and setup I tested, not necessarily of any other. Again, for details on what and how I tested, see my previous article.

I hope to test other scenarios in the future. I’d love to also test on a faster CPU with lots of cores, unfortunately I don’t own such hardware so it is unlikely to happen.

Finally, I set up a github repository fabhttp which contains:

  1. source code of the faban driver used to run these tests
  2. dstat/nicstat data collected during the runs (used to generate the graphs above)
  3. additional graphs generated by faban for every individual run

 

Web Server Performance Testing

I started to run some performance tests on heliod to compare it with a handful of other web servers out there. I’ll publish results in the next article once the runs complete.

Now, heliod shines brightest on large system with multiple CPUs of many cores. Unfortunately I don’t own such hardware so I’m testing on a very low-end system I have available, but it should still be interesting.

One of the many challenges of benchmarking is deciding what to test. All performance tests are to some extent artificial, ultimately all that matters is how it works in production with the actual customer traffic over the production network.

For these runs I chose to measure static file performance with files of size 8K.

One of my pet peeves are articles which show someone’s performance run results without any details as to how they were measured. So to avoid doing that myself, below are all the details on how I’m running these tests. In addition, I will publish the driver source and data set so you can run the same tests on your machine if you like.

Client

A trustworthy load generator client is vital for performance testing, something too often overlooked (I can’t believe anyone is still using ‘ab’, for instance!). If the client can’t scale or in other ways introduces limitations the resulting numbers will be meaningless because they reflect the limits of the client not those of the server being tested.

I’m using faban for these tests.

I’m running faban on my Open Indiana machine which has an AMD Phenom II X4 (quad core) 925 CPU and 8GB RAM.

Server

I’m running the various web servers under test on a small Linux box. It would be fun to test on a server with lots of cores but this is what I have available:

  • CPU: Intel Atom D510 (1.66GHz, 2 cores, 4 hardware threads)
  • RAM: 4GB
  • OS: Debian 6.0.6 32bit

Network

Both machines have gigabit ethernet and are connected to the same gigabit switch.

As an aside, there seems to be a trend of testing servers via localhost (that is, with the load generator on the same machine connecting to http://localhost). Doing so is a mistake that will report meaningless numbers. Your customers won’t be connecting to the server on localhost, so your load generator shouldn’t either.

Software Configuration

Another important decision for benchmarking is how to tune the software stack. For this round of runs I am running out-of-the-box default configurations for everything. This means the results will not be optimized. I’m sure most, if not all, the web servers being tested could score higher if their configuration is tuned to the very specific requirements of this particular test and hardware environment. Why test default configurations? A few reasons:

  • Baseline: I expect I’ll run more optimized configurations later on, so it is nice to have a baseline to compare how much future tuning helps.
  • Reality: Truth is, lots of servers do get deployed into production with default configurations. So while the results are not optimal, they are relevant to everyone that has not taken the time to tune their server configuration.
  • Fairness: I am familiar with performance tuning only a couple of the web servers I’m testing here. If I tune those well and the other ones badly I’ll influence the results in favor of the ones I know. So to be fair, I won’t tune any of them.

Software Versions

Whenever possible, I installed the Debian package for each server.

  • apache2-mpm-event                2.2.16-6+squeeze8
  • apache2-mpm-worker              2.2.16-6+squeeze8
  • lighttpd                                     1.4.28-2+squeeze1
  • nginx                                         0.7.67-3+squeeze2
  • cherokee                                  1.0.8-5+squeeze1
  • monkey                                    0.9.3-1
  • varnish                                      2.1.3-8
  • g-wan from http://gwan.com/archives/gwan_linux32-bit.tar.bz2
  • heliod 0.2 from http://sourceforge.net/projects/heliod/files/release-0.2/

Operation

The server has 1000 files, each file is different and each one is 8K in size. The faban driver requests a random file out of the 1000 each time.

There is no client think time between requests. That is, each client thread will load up the server as fast as the server can respond.

For each server I am doing ten runs, starting with 1 concurrent client (sequential requests) up to 10 concurrent clients. Keep in mind the server only has 1 CPU with 4 hardware threads in two cores, so one should expect the throughput to scale up from 1 to 4 concurrent clients and start to taper off after that.

Each run is 30 minutes long. This allows a bit of time to see if throughput remains consistent for a given load level. Each run is preceded by a one minute warmup time.