I started to run some performance tests on heliod to compare it with a handful of other web servers out there. I’ll publish results in the next article once the runs complete.
Now, heliod shines brightest on large system with multiple CPUs of many cores. Unfortunately I don’t own such hardware so I’m testing on a very low-end system I have available, but it should still be interesting.
One of the many challenges of benchmarking is deciding what to test. All performance tests are to some extent artificial, ultimately all that matters is how it works in production with the actual customer traffic over the production network.
For these runs I chose to measure static file performance with files of size 8K.
One of my pet peeves are articles which show someone’s performance run results without any details as to how they were measured. So to avoid doing that myself, below are all the details on how I’m running these tests. In addition, I will publish the driver source and data set so you can run the same tests on your machine if you like.
A trustworthy load generator client is vital for performance testing, something too often overlooked (I can’t believe anyone is still using ‘ab’, for instance!). If the client can’t scale or in other ways introduces limitations the resulting numbers will be meaningless because they reflect the limits of the client not those of the server being tested.
I’m using faban for these tests.
I’m running faban on my Open Indiana machine which has an AMD Phenom II X4 (quad core) 925 CPU and 8GB RAM.
I’m running the various web servers under test on a small Linux box. It would be fun to test on a server with lots of cores but this is what I have available:
Both machines have gigabit ethernet and are connected to the same gigabit switch.
As an aside, there seems to be a trend of testing servers via localhost (that is, with the load generator on the same machine connecting to http://localhost). Doing so is a mistake that will report meaningless numbers. Your customers won’t be connecting to the server on localhost, so your load generator shouldn’t either.
Another important decision for benchmarking is how to tune the software stack. For this round of runs I am running out-of-the-box default configurations for everything. This means the results will not be optimized. I’m sure most, if not all, the web servers being tested could score higher if their configuration is tuned to the very specific requirements of this particular test and hardware environment. Why test default configurations? A few reasons:
- Baseline: I expect I’ll run more optimized configurations later on, so it is nice to have a baseline to compare how much future tuning helps.
- Reality: Truth is, lots of servers do get deployed into production with default configurations. So while the results are not optimal, they are relevant to everyone that has not taken the time to tune their server configuration.
- Fairness: I am familiar with performance tuning only a couple of the web servers I’m testing here. If I tune those well and the other ones badly I’ll influence the results in favor of the ones I know. So to be fair, I won’t tune any of them.
Whenever possible, I installed the Debian package for each server.
- apache2-mpm-event 2.2.16-6+squeeze8
- apache2-mpm-worker 2.2.16-6+squeeze8
- lighttpd 1.4.28-2+squeeze1
- nginx 0.7.67-3+squeeze2
- cherokee 1.0.8-5+squeeze1
- monkey 0.9.3-1
- varnish 2.1.3-8
- g-wan from http://gwan.com/archives/gwan_linux32-bit.tar.bz2
- heliod 0.2 from http://sourceforge.net/projects/heliod/files/release-0.2/
The server has 1000 files, each file is different and each one is 8K in size. The faban driver requests a random file out of the 1000 each time.
There is no client think time between requests. That is, each client thread will load up the server as fast as the server can respond.
For each server I am doing ten runs, starting with 1 concurrent client (sequential requests) up to 10 concurrent clients. Keep in mind the server only has 1 CPU with 4 hardware threads in two cores, so one should expect the throughput to scale up from 1 to 4 concurrent clients and start to taper off after that.
Each run is 30 minutes long. This allows a bit of time to see if throughput remains consistent for a given load level. Each run is preceded by a one minute warmup time.