Duplicate file detection with dupd

I love my zfs file server… but as always with such things, storage brings an accumulation of duplicates. During a cleaning binge earlier this year I wrote a little tool to identify these duplicates conveniently. For months now I’d been meaning to clean up the code a bit and throw together some documentation so I could publish it. Well, finally got around to it and dupd is now up on github.

Before writing dupd I tried a few similar tools that I found on a quick search but they either crashed or were unspeakably slow on my server (which has close to 1TB of data).

Later I found some better tools like fdupes but by then I’d mostly completed dupd so decided to finish it. Always more fun to use one’s own tools!

I’m always interested in performance so can’t resist the opportunity to do some speed comparisons. I also tested fastdup.

Nice to see that dupd is the fastest of the three on these (fairly small) data sets (I did not benchmark my full file server because even with dupd it takes nearly six hours for a single run).

There is no result for fastdup on the Debian /usr scan because it hangs and does not produce a result (unfortunately fastdup is not very robust and looks like it hangs on symlinks… so while it is fast when it works, it is not practical for real use yet).

The times displayed on the graph were computed as follows: I ran the command once to warm up the cache and then ran it ten times in a row. I discarded the two fastest and two slowest runs and averaged the remaining six runs.

 

Web Server Efficiency

In my previous article I covered the benchmark results from static file testing of various web servers. One interesting observation was how much difference there was in CPU consumption even between servers delivering roughly comparable results. For example, nginx, apache-worker and cherokee delivered similar throughput with 10 concurrent clients but apache-worker just about saturated the CPU while doing so, unlike the other two.

I figured it would be interesting to look at the efficiency of each of these servers by computing throughput per percentage of CPU capacity consumed. Here is the resulting graph:

In terms of raw throughput apache-worker came in third place but here it does not do well at all because, as mentioned, it maxed out the CPU to deliver its numbers. Cherokee, previously fourth, also drops down in ranking when considering efficiency since it also used a fair amount of CPU.

The largest surprise here is varnish which performed very well (second place) in raw throughput. While it was almost able to match heliod, it did consume quite a bit more CPU capacity to do so which results in relatively low efficiency numbers seen here.

Lighttpd and nginx do well here in terms of efficiency – while their absolute throughput wasn’t as high, they also did not consume much CPU. (Keep in mind these baseline runs were done with a default configuration, so nginx was only running one worker process.)

I’m pleasantly surprised that heliod came on top once again. Not only did it sustain the highest throughput, turns out it also did it more efficiently than any of the other web servers! Nice!

Now, does this CPU efficiency index really matter at all in real usage? Depends…

If you have dedicated web server hardware then not so much. If all the CPU is doing is running the web server then might as well fully utilize it for that. Although there should still be some benefit from a more efficient server in terms of lower power consumption and lower heat output.

However, if you’re running on virtual instances (whether your own or on a cloud provider) where the physical CPUs are shared then there are clear benefits to efficiency. Either to reduce CPU consumption charges or just to free up more CPU cycles to the other instances running on the same hardware.

Or… you could just use heliod in which case you don’t need to choose between throughput vs. efficiency given that heliod produced both the highest throughput (in this benchmark scenario anyway) and the highest efficiency ranking.

 

The fastest web server is …

A few days ago I mentioned that I had started doing some static file performance runs on a handful of web servers. Here are the results!

Please refer to the previous article on setup details for info on exactly what and how I’m testing this time. Benchmark results apply to the narrow use case being tested and this one is no different.

The results are ordered from slowest to fastest peak numbers produced by each server.

9. Monkey (0.9.3-1 debian package)

The monkey server was not able to complete the runs so it is not included in the graphs. At a concurrency of 1, it finished the 30 minute run with an average of only 133 requests/second, far slower than any of the others. With only two concurrent clients it started erroring out on some requests so I stopped testing it. Looks like this one is not ready for prime time yet.

8. G-WAN (3.3.28)

I had seen some of the performance claims around G-WAN so decided to try it, even though it is not open source. All the tests I’d seen of it had been done running on localhost so I was curious to see how it behaves under a slightly more realistic type of load. Turns out, not too well. Aside from monkey, it was the slowest of the group.

7. Apache HTTPD, Event MPM (2.2.16-6+squeeze8 debian package)

I was surprised to see the event MPM do so badly. To be fair, it should do better against a benchmark which has large numbers of mostly idle clients which is not what this particular benchmark tested. At most points in the test it was also the highest consumer of CPU.

6. lighttpd (1.4.28-2+squeeze1 debian package)

Here we get to the first of the serious players (for this test scenario). lighttpd starts out strong up to 3 concurrent clients. After that it stops scaling up so it loses some ground in the final results. Also, lighttpd is the lightest user of CPU of the
group.

5. nginx (0.7.67-3+squeeze2 debian package)

The nginx throughput curve is just about identical to lighttpd, just shifted slightly higher. The CPU consumption curve is also almost identical. These two are twins separated at birth. While nginx uses a tiny bit more CPU than lighttpd, it makes up for it with higher throughput.

4. cherokee (1.0.8-5+squeeze1 debian package)

Cherokee just barely edges out nginx at the higher concurrencies tested so it ends up fourth. To be fair, nginx was faster than cherokee at most of the lower concurrencies though. Note, however, that cherokee uses quite a bit more CPU to deliver its numbers so it is not as efficient as nginx.

3. Apache HTTPD, Worker MPM (2.2.16-6+squeeze8 debian package)

Apache third, really? Yes but only because this ranking is based on the peak numbers of each server. With worker mpm, apache starts out quite a bit behind lighttpd/nginx/cherokee at lower client concurrencies. However, as those others start to stall as concurrency increases, apache keeps going higher. Around five concurrent clients it catches up to lighttpd and around eight clients it catches up to nginx and cherokee. At ten
it scores a throughput just slightly above those two, securing third place in this test. Looking at CPU usage tough, at that point it has just about maxed out the CPU (about 1% idle) making it the highest CPU consumer of this group so it is not very efficient.

2. varnish (2.1.3-8 debian package)

Varnish is not really a web server, of course, so in that sense it is out of place in this test. But it can serve (cached) static files and has been included in other similar performance tests so I decided to include it here.

Varnish throughput starts out quite a bit slower than nginx, right on par with lighttpd and cherokee and lower concurrencies. However, varnish scales up beautifully. Unlike all the previous servers, its throughput curve does not flatten out as concurrency increases in this
test, it keeps going higher. Around four concurrent users it surpasses nginx and only keeps going higher all the way to ten.

Varnish was able to push network utilization to 90-94%. The only drawback is that delivering its performance does use up a lot of CPU… only Apache used more CPU than varnish in this test. At ten clients, there is only 9% idle CPU left.

1. heliod (0.2)

heliod had the highest throughput at every point tested in these runs. It is slightly faster than nginx at sequential requests (one client) and then pulls away.

heliod is also quite efficient in CPU consumption. Up to four concurrent clients it is the lightest user of CPU cycles even though it produced higher throughput than all the others. At higher concurrencies, it used slightly more CPU than nginx/lighttpd although it makes up for it with far higher throughput.

heliod was also the only server able to saturate the gigabit connection (at over 97% utilization). Given that there is 62% idle CPU left at that point, I suspect if I had more bandwidth heliod might be able to score even higher on this machine.

These results should not be much of a surprise… after all heliod is not new, it is the same code that has been setting benchmark records for over ten years (it just wasn’t open source back then). Fast then, still fast today.

If you are running one of these web servers and using varnish to accelerate it, you could  switch to heliod by itself and both simplify your setup and gain performance at the same time. Food for thought!


All right, let’s see some graphs..

First, here is the overall throughput graph for all the servers tested:

As you can see the servers fall into three groups in terms of throughput:

  1. apache-event and g-wan are not competitive in this crowd
  2. apache-worker/nginx/lighttpd/cherokee are quite similar in the middle
  3. varnish and heliod are in a class of their own at the high end

Next graph shows the 90th percentile response time for each server. That is, 90 percent of all requests completed in this time or less. I left out apache-event and g-wan from the graph to avoid compressing the more interesting part of the graph:

The next graph shows CPU idle time (percent) for each server through the run. The spikes to 100% between each step are due to the short idle interval between each run as faban starts the next run.

The two apache variants (red and orange) are the only ones who maxed out the CPU. Varnish (light green) also uses quite a bit of CPU and comes close (9% idle). On the other side, lighttpd (dark red) and nginx (light blue) put the least load on the CPU with about 72% idle.

Finally, the next graph shows network utilization percentage of the gigabit interface:

Here heliod (blue) is the only one which manages to saturate the network, with varnish coming in quite close. None of the others manage to reach even 60% utilization.

So there you have it… heliod can sustain far higher throughput than any of the popular web servers in this static file test and it can do so efficiently, saturating the network on a low power two core machine while leaving plenty of CPU idle. It even manages to sustain higher throughput than varnish which specializes in caching static content efficiently and is not a full featured web server.

Of course, all benchmarks are by necessity artificial. If any of the variables change the numbers will change and the rankings may change. These results are representative of the exact use case and setup I tested, not necessarily of any other. Again, for details on what and how I tested, see my previous article.

I hope to test other scenarios in the future. I’d love to also test on a faster CPU with lots of cores, unfortunately I don’t own such hardware so it is unlikely to happen.

Finally, I set up a github repository fabhttp which contains:

  1. source code of the faban driver used to run these tests
  2. dstat/nicstat data collected during the runs (used to generate the graphs above)
  3. additional graphs generated by faban for every individual run

 

Web Server Performance Testing

I started to run some performance tests on heliod to compare it with a handful of other web servers out there. I’ll publish results in the next article once the runs complete.

Now, heliod shines brightest on large system with multiple CPUs of many cores. Unfortunately I don’t own such hardware so I’m testing on a very low-end system I have available, but it should still be interesting.

One of the many challenges of benchmarking is deciding what to test. All performance tests are to some extent artificial, ultimately all that matters is how it works in production with the actual customer traffic over the production network.

For these runs I chose to measure static file performance with files of size 8K.

One of my pet peeves are articles which show someone’s performance run results without any details as to how they were measured. So to avoid doing that myself, below are all the details on how I’m running these tests. In addition, I will publish the driver source and data set so you can run the same tests on your machine if you like.

Client

A trustworthy load generator client is vital for performance testing, something too often overlooked (I can’t believe anyone is still using ‘ab’, for instance!). If the client can’t scale or in other ways introduces limitations the resulting numbers will be meaningless because they reflect the limits of the client not those of the server being tested.

I’m using faban for these tests.

I’m running faban on my Open Indiana machine which has an AMD Phenom II X4 (quad core) 925 CPU and 8GB RAM.

Server

I’m running the various web servers under test on a small Linux box. It would be fun to test on a server with lots of cores but this is what I have available:

  • CPU: Intel Atom D510 (1.66GHz, 2 cores, 4 hardware threads)
  • RAM: 4GB
  • OS: Debian 6.0.6 32bit

Network

Both machines have gigabit ethernet and are connected to the same gigabit switch.

As an aside, there seems to be a trend of testing servers via localhost (that is, with the load generator on the same machine connecting to http://localhost). Doing so is a mistake that will report meaningless numbers. Your customers won’t be connecting to the server on localhost, so your load generator shouldn’t either.

Software Configuration

Another important decision for benchmarking is how to tune the software stack. For this round of runs I am running out-of-the-box default configurations for everything. This means the results will not be optimized. I’m sure most, if not all, the web servers being tested could score higher if their configuration is tuned to the very specific requirements of this particular test and hardware environment. Why test default configurations? A few reasons:

  • Baseline: I expect I’ll run more optimized configurations later on, so it is nice to have a baseline to compare how much future tuning helps.
  • Reality: Truth is, lots of servers do get deployed into production with default configurations. So while the results are not optimal, they are relevant to everyone that has not taken the time to tune their server configuration.
  • Fairness: I am familiar with performance tuning only a couple of the web servers I’m testing here. If I tune those well and the other ones badly I’ll influence the results in favor of the ones I know. So to be fair, I won’t tune any of them.

Software Versions

Whenever possible, I installed the Debian package for each server.

  • apache2-mpm-event                2.2.16-6+squeeze8
  • apache2-mpm-worker              2.2.16-6+squeeze8
  • lighttpd                                     1.4.28-2+squeeze1
  • nginx                                         0.7.67-3+squeeze2
  • cherokee                                  1.0.8-5+squeeze1
  • monkey                                    0.9.3-1
  • varnish                                      2.1.3-8
  • g-wan from http://gwan.com/archives/gwan_linux32-bit.tar.bz2
  • heliod 0.2 from http://sourceforge.net/projects/heliod/files/release-0.2/

Operation

The server has 1000 files, each file is different and each one is 8K in size. The faban driver requests a random file out of the 1000 each time.

There is no client think time between requests. That is, each client thread will load up the server as fast as the server can respond.

For each server I am doing ten runs, starting with 1 concurrent client (sequential requests) up to 10 concurrent clients. Keep in mind the server only has 1 CPU with 4 hardware threads in two cores, so one should expect the throughput to scale up from 1 to 4 concurrent clients and start to taper off after that.

Each run is 30 minutes long. This allows a bit of time to see if throughput remains consistent for a given load level. Each run is preceded by a one minute warmup time.

heliod and CRIME

CRIME is an interesting approach to leak information being protected by SSL/TLS.  It is an easy to understand/explain example of why security issues are nearly always more complex than they seem!

heliod uses NSS, which by default has TLS compression disabled so it is not vulnerable.

Web Server

If you were to look at the HTTP response headers from this site, you’d see it is being handled by:

Server: heliod-web-server/0.1

Which is a web server you’ve probably never heard of before… Or I should say, you most likely have, but with various different names.

Way back when, this was the Netscape Enterprise Server. Which later became iPlanet Web Server (during the Sun|Netscape alliance). Under Sun alone, it was renamed several times to SunONE Web Server and Sun Java System Web Server (and maybe some other name variants I forget now). Naming nonsense aside, it’s been the same evolving code base all along, best known for high performance and even higher scalability.

Thankfully, Sun open sourced the code in 2009 under the BSD license. Most of it, anyway. Unfortunately a few parts were left out, mainly the administration support, installer and the embedded Java servlet engine. The open source code was kept in sync with the commercial releases until January 2010 (7.0 update 8, using the commercial release version numbering). After that, the open source repository has not seen any activity (not coincidentally, January 2010 was also when Oracle acquired Sun, so this is not surprising).

Surprisingly, the source repository is still available:

hg clone ssh://anon@hg.opensolaris.org/hg/webstack/webserver

The source as published can be tricky to build and it does not produce an installable package. When I was setting up this site last year I ended up forking this code into http://sourceforge.net/projects/heliod/. The code is the same but I added a rudimentary install script to make it easier to get going. You can download binaries for Solaris (x86) and Linux from the sourceforge page so you don’t have to build it yourself if you prefer not to.

(Update: The source is now in github here: https://github.com/jvirkki/heliod)

 

Joyent Debacle

I’ve been hosting this server on Joyent for a while now for a few reasons. One was that their VMs are Solaris zones which is (a) cool, (b) I prefer hosting Internet-facing servers on Solaris and (c) some of the Sun talent moved to Joyent after the oracle disaster so I liked the idea of supporting Joyent. The other reason was that Joyent offered a fixed-price-for-life server when I signed up so it was a nice deal as well.

Yesterday Joyent broke their promises by dropping all the fixed & lifetime plans out of the blue. There’s been coverage on Slashdot, Network World, ZDNet and plenty other places. Discussion rages on at the support forum and there is a google group dedicated to finding alternatives as well.

The people who prepaid for a lifetime plan were the hardest hit. For me it is not that bad as I was on a monthly plan (not prepaid) but they did still break the promise of a maintaining a fixed price for life.

I could migrate to their new plans, which are about 50% more expensive per month but it seems hard to justify why I should trust these people anymore. So I’ll probably migrate this server elsewhere once I do some research to find something more trustworthy than Joyent. The sad part is it probably won’t be Solaris ;-(

 

Bloom filter vs. CPU cache

I was playing around with a bloom filter today and drawing some graphs on performance vs. various metrics. While for my real use case the filter bit array is larger, just for fun I wanted to look at how the performance changes as the array size exceeds various CPU caches.

I’m running on an AMD Phenom II X4 925 Processor which has 64K L1 cache (per core), 512K L2 cache (per core) and 6MB L3 cache (shared). The bloom filter code is single threaded.

The following graph shows the time (yellow line) taken to insert 10 million entries into the bloom filter as the size of the bit array (red line) increases linearly (the two lines are not on the same y-axis scale). For the first two-thirds of the graph the time taken is just over 2 seconds, or just under 5 million elements per second. The sudden change in the slope of the line is near 512K at which point the array no longer fits the L2 cache.

The next graph zooms out to show the size of the bit array increasing all the way to 18MB. The second slope change (about a third of the way from the left)  is in the neighborhood of 6MB, corresponding to the L3 cache size.

 

Bandwidth throttling with faban

I often use faban for performance related work. Nearly always I have used it while working on APIs which are called by other servers (as opposed to humans, who linger between mouse clicks) and where bandwidth use is not a significant factor (the processing time of the request outweighs the request/response time by orders of magnitude). For these requirements it has always worked well to run the faban tests with zero think time and letting it issue requests as fast as the server can handle.

Recently, however, I’ve been looking into a system where the request and/or response bodies are quite large, so the bulk of the total request time is consumed by the data transmission over the network. This creates a bit of a problem because in the lab the faban machine and the server being tested (“SUT”) are wired together via gigabit ethernet so there is a decent amount of bandwidth between them. While that sounds like a good problem to have, the reality is that in production the end users are coming in over the internet and have far lower bandwidth.

Thus, the testing is not very realistic. Faban can saturate the server with just a few users uploading a gigabit speeds even though I know the server can handle far more users when each one is uploading at much slower speeds over the internet.

Turns out faban has the capability to throttle the upload and/or download bandwidth over a given socket. As far as I could find this is not documented anywhere, I found it by accident while looking at the code when I was considering various solutions.

Here’s one way (there may be other ways) to use it:

ctx = DriverContext.getContext();
com.sun.faban.driver.engine.DriverContext engine =
    (com.sun.faban.driver.engine.DriverContext)ctx;

// Set desired speed in K per second, or -1 to disable throttling
engine.setUploadSpeed(uploadKBps);
engine.setDownloadSpeed(downloadKBps);

As of this writing the latest faban version is 1.0.2. In this version the upload throttling works fine but downloads (i.e. reading the response body) can hang if throttling is enabled. I filed a bug with a fix that is working reliably for me. If you try this with 1.0.2 (or earlier, probably) then you’ll need to apply that change and rebuild faban.

 

What is your cache hit rate?

While this may sound like an obvious metric to check, I’m often seeing that developers don’t verify the cache hit rate on their code under realistic conditions. The end result is a server which performs worse than if it had no cache at all.

We all know the benefits of keeping a local cache.. relatively cheap to keep and it saves having to make more expensive calls to obtain the data from wherever it ultimately resides. Just don’t forget that keeping that cache, while cheap, takes non-zero CPU and memory resources. The code must get more benefit from it than the cost of maintaining the cache, otherwise it is a net loss.

I was recently reviewing a RESTful service which kept a cache of everything it processed. The origin retrieval was relatively expensive so this seemed like a good idea. However, given the size of the objects being processed vs. the amount of RAM allocated to this service in production, my first question was what’s the cache hit rate?

Developers didn’t know, but felt that as long as it saves any back-end hit it must help, right?

A good rule of thumb is that anything that isn’t being measured is probably misbehaving… and this turned out to be no exception.

Testing under some load (I like faban for this) showed the server was allocating and quickly garbage collecting tens of thousands of buffer objects per second for the cache. Hit rate you ask? Zero!

Merely commenting out the cache gave a quick 10% boost in overall throughput.

So that’s my performance tip of the day.. be aware of you cache hit rates!