Duplicate file detection with dupd

I love my zfs file server… but as always with such things, storage brings an accumulation of duplicates. During a cleaning binge earlier this year I wrote a little tool to identify these duplicates conveniently. For months now I’d been meaning to clean up the code a bit and throw together some documentation so I could publish it. Well, finally got around to it and dupd is now up on github.

Before writing dupd I tried a few similar tools that I found on a quick search but they either crashed or were unspeakably slow on my server (which has close to 1TB of data).

Later I found some better tools like fdupes but by then I’d mostly completed dupd so decided to finish it. Always more fun to use one’s own tools!

I’m always interested in performance so can’t resist the opportunity to do some speed comparisons. I also tested fastdup.

Nice to see that dupd is the fastest of the three on these (fairly small) data sets (I did not benchmark my full file server because even with dupd it takes nearly six hours for a single run).

There is no result for fastdup on the Debian /usr scan because it hangs and does not produce a result (unfortunately fastdup is not very robust and looks like it hangs on symlinks… so while it is fast when it works, it is not practical for real use yet).

The times displayed on the graph were computed as follows: I ran the command once to warm up the cache and then ran it ten times in a row. I discarded the two fastest and two slowest runs and averaged the remaining six runs.


Web Server Efficiency

In my previous article I covered the benchmark results from static file testing of various web servers. One interesting observation was how much difference there was in CPU consumption even between servers delivering roughly comparable results. For example, nginx, apache-worker and cherokee delivered similar throughput with 10 concurrent clients but apache-worker just about saturated the CPU while doing so, unlike the other two.

I figured it would be interesting to look at the efficiency of each of these servers by computing throughput per percentage of CPU capacity consumed. Here is the resulting graph:

In terms of raw throughput apache-worker came in third place but here it does not do well at all because, as mentioned, it maxed out the CPU to deliver its numbers. Cherokee, previously fourth, also drops down in ranking when considering efficiency since it also used a fair amount of CPU.

The largest surprise here is varnish which performed very well (second place) in raw throughput. While it was almost able to match heliod, it did consume quite a bit more CPU capacity to do so which results in relatively low efficiency numbers seen here.

Lighttpd and nginx do well here in terms of efficiency – while their absolute throughput wasn’t as high, they also did not consume much CPU. (Keep in mind these baseline runs were done with a default configuration, so nginx was only running one worker process.)

I’m pleasantly surprised that heliod came on top once again. Not only did it sustain the highest throughput, turns out it also did it more efficiently than any of the other web servers! Nice!

Now, does this CPU efficiency index really matter at all in real usage? Depends…

If you have dedicated web server hardware then not so much. If all the CPU is doing is running the web server then might as well fully utilize it for that. Although there should still be some benefit from a more efficient server in terms of lower power consumption and lower heat output.

However, if you’re running on virtual instances (whether your own or on a cloud provider) where the physical CPUs are shared then there are clear benefits to efficiency. Either to reduce CPU consumption charges or just to free up more CPU cycles to the other instances running on the same hardware.

Or… you could just use heliod in which case you don’t need to choose between throughput vs. efficiency given that heliod produced both the highest throughput (in this benchmark scenario anyway) and the highest efficiency ranking.