dupd vs. jdupes

Posted on 2016/02/10 by jyri

Tonight I ran across jdupes which I had not seen before. It is a fork of the venerable fdupes with quite a few performance improvements. Performance?! Well I had to try it of course. Here are a few runs of jdupes and dupd on my home directory for comparison (using -A to skip hidden files which is the default in dupd):

% repeat 5 time ./jdupes -r $HOME -A  > out
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.06s user 10.96s system 99% cpu 14.156 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.07s user 10.82s system 99% cpu 14.029 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.00s user 11.01s system 99% cpu 14.156 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.22s user 11.03s system 98% cpu 14.414 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.04s user 10.87s system 99% cpu 14.042 total

So, consistently about 14 seconds.

% repeat 5 time dupd scan -p $HOME -q
dupd scan -p $HOME -q  3.99s user 6.76s system 139% cpu 7.691 total
dupd scan -p $HOME -q  4.16s user 6.28s system 140% cpu 7.416 total
dupd scan -p $HOME -q  4.13s user 6.53s system 141% cpu 7.540 total
dupd scan -p $HOME -q  3.98s user 6.39s system 139% cpu 7.405 total
dupd scan -p $HOME -q  4.00s user 6.44s system 140% cpu 7.404 total

About 7.5 seconds, or just under that, for dupd.

I still have a handful of ideas to make dupd faster, as I find some spare time I’ll try them out.

dupd 1.3 released

Posted on 2015/12/31 by jyri

I intended 1.2 to be my “year-end” release but ended up cleaning up some things in the few days since, including one bug, so might as well call this 1.3 now.

A few pre-built binaries are again available in case that’s helpful.

dupd release 1.3 link on github.

dupd 1.2 released

Posted on 2015/12/24 by jyri

I just tagged the release of dupd 1.2, enjoy hunting those duplicates!

This time I included pre-built binaries for a few platforms. Probably mostly useful on OS X for those without dev tools intalled.

Some dupd performance improvements

Posted on 2015/12/24 by jyri

Performance Improvements in dupd 1.2

Recently I’ve done a few performance improvements to dupd, motivated by one particular edge case file set I was working with a while back. That file set had very large numbers (over 100K) of files of the same size (these were log files from a production system where the content was always different but due to the structure of the files they tended to have the same size). This was a worst case scenario for dupd given the way it grouped files of the same size as potential duplicates. With the latest changes (in dupd 1.2) this scenario is dramatically faster (scan time reduced from about an hour to about five minutes – see below).

In more common scenarios these improvements don’t make a big difference but there is still some small benefit. Memory consumption is also reduced in dupd 1.2 (there is more room to reduce memory consumption that I might play with if I have time some day).

In a nutshell, dupd 1.2 should be either no slower, slightly faster or in some edge cases dramatically faster than dupd 1.1.

The edge case: lots of files of the same size

With dupd 1.1 scan time was 59m57s which is what motivated me to improve it for that file set. Now with dupd 1.2, scan time for the same file set is only 4m57s! Mission accomplished.

The three main changes were:

ptr2end (Reduced time from 59m57s to 26m57s) – Simply store a pointer to the end of the size list instead of walking it. Normally the size lists are tiny, on average I see well under 10 elements. But when it grew to over 100K elements this made a huge difference.
local_memcmp (Reduced time from 26m57s to 20m36s) – Instead of using memcmp(3) always, use a local implementation when the buffers being compared are small. This made a surprising amount of difference.
hashlist_ptr (Reduced time from 20m36s to 4m57s) – As dupd processes file sets from the sizelist to the hashlists, it was copying the paths. Now, just copy pointers. This skips a lot of unnecessary strcpy(3)ing as well as reduces memory consumption.

Normal case: smaller set of files with no odd size distributions

That said, do these changes translate to any benefit on more “normal” file sets? Nowhere near as dramatically, but it’s still faster and uses less memory so that’s all good.

These scans are from my $HOME dir on one machine, scan time reduced from 10.6s (average of 5 runs) to 8.1s, an improvement of about 23%, not bad at all.

No change: spinning rust

All the numbers above are from machines with SSDs. I also tested on a couple machines with traditional hard drives and there was zero change in performance. No graph, it’s just a straight line ;-)

With normal hard drives, the file I/O time so completely dominates run time that there is no difference from any dupd improvements.

(I suspect the edge case file set would have seen improvement even on spinning rust, but I didn’t have the chance to test that scenario.)

heliod 0.3

Posted on 2015/10/15 by jyri

A new release of heliod is available, version 0.3.

Pre-built binaries for a few platforms are available: https://github.com/jvirkki/heliod/releases/tag/v0.3

The main driver for this release is that I needed a 64 bit build for debian, as I’ve been meaning to upgrade my server for a long while but was held back by the lack of a 64 bit Linux build in 0.2.

Duplicate detection with dupd

Posted on 2015/10/09 by jyri

I’ve written a few times about dupd, my little CLI tool for finding duplicate files. It tends to perform well, which was one of the goals I had for it.

Randomly browsing the web tonight I came across this article on “What is the fastest way to find duplicate pictures?“. Nice to see the author concluded that:

"Dupd was the clear speed winner"

I’m glad it has worked well for others!

heliod relocated

Posted on 2015/10/08 by jyri

It’s unfortunate to downgrade from mercurial to git but overall should be for the better.

I have moved the heliod source code from its previous home to github here: https://github.com/jvirkki/heliod

I also copied the release-0.2 binaries (built on debian-6, Solaris 10×86 and Solaris 10 SPARC) to the github release files: https://github.com/jvirkki/heliod/releases

If for some reason you want to download the release-0.1 binaries, they are still available on sourceforge here: http://sourceforge.net/projects/heliod/files/

Comparison of bicycle gear ratios

Posted on 2015/02/20 by jyri

Gear Ratios

Linking these here for my future reference…

Roubaix

50-39-30 chainring, 12-30 Ultegra cassette (10 speed)

Speed @90 rpm

    |    12    13    14    15    17    19    21    24    27    30
----+------------------------------------------------------------
 50 |  29.3  27.0  25.1  23.4  20.6  18.5  16.7  14.6  13.0  11.7
 39 |  22.8  21.1  19.6  18.3  16.1  14.4  13.0  11.4  10.1   9.1
 30 |  17.6  16.2  15.0  14.0  12.4  11.1  10.0   8.8   7.8   7.0

Enduro MTB

32 chainring, 12-36 cassette (9 speed)

Speed @90 rpm

     |    12    14    16    18    21    24    28    32    36
 ----+------------------------------------------------------
  32 |  18.6  15.9  13.9  12.4  10.6   9.3   8.0   7.0   6.2

XX1/30

Speed @90 rpm

    |    10    12    14    16    18    21    24    28    32    36    42
----+------------------------------------------------------------------
 30 |  21.8  18.2  15.6  13.6  12.1  10.4   9.1   7.8   6.8   6.1   5.2

Duplicate file detection performance

Posted on 2015/01/06 by jyri

Just over two years ago I tested my dupd against a couple other duplicate detection tools.

Recently I’ve been doing some duplicate cleanup again and while at it I added a few features to dupd and called it version 1.1. So this is as good time as any to revisit the previous numbers.

I tested a small subset of my file server data using six duplicate detection tools:

dupd 1.1
rdfind 1.3.1 (debian package)
rmlint 2.0.0
fslint 2.42-2 (debian package)
fdupes 1.50-PR2-4 (debian package)
fastdup 0.3

Results

The graph shows the time (in seconds) it took each utility to scan and identify all duplicates in my sample set. I’m happy to see dupd took less than half the time of the next fastest option (rdfind) and just over seven times faster than fdupes.

Details

The Data

The sample set is 18GB in size and has 392,378 files. There are a total of 117,261 duplicates.

The Machine

I ran this on my small home server, which has an Intel Atom CPU S1260 @ 2.00GHz (4 cores), 8GB RAM, Intel 520 series SSD.

The Runs

For each tool, first I ran it once and ignored the time, just to populate file caches. Then I ran it five times in a row. Discarding the fastest and slowest time, I averaged the remaining three runs to come up with the time shown in the graph above. For most of the tools, the scan times were very consistent from run to run.

dupd

dupd scan --path $HOME/data -q  13.31s user 15.94s system 99% cpu 29.533 total

dupd scan --path $HOME/data -q  13.17s user 16.09s system 99% cpu 29.539 total
dupd scan --path $HOME/data -q  13.17s user 16.13s system 99% cpu 29.572 total
dupd scan --path $HOME/data -q  13.28s user 16.04s system 99% cpu 29.604 total

dupd scan --path $HOME/data -q  13.59s user 15.74s system 99% cpu 29.605 total

rdfind

rdfind -dryrun true $HOME/data  49.28s user 24.98s system 99% cpu 1:14.75 total

rdfind -dryrun true $HOME/data  49.08s user 25.29s system 99% cpu 1:14.87 total
rdfind -dryrun true $HOME/data  48.93s user 25.52s system 99% cpu 1:14.92 total
rdfind -dryrun true $HOME/data  48.92s user 25.53s system 99% cpu 1:14.95 total

rdfind -dryrun true $HOME/data  49.52s user 25.09s system 99% cpu 1:15.11 total

rmlint

./rmlint -T duplicates $HOME/data  63.53s user 52.55s system 113% cpu 1:42.69 total

./rmlint -T duplicates $HOME/data  64.67s user 52.46s system 113% cpu 1:43.43 total
./rmlint -T duplicates $HOME/data  64.01s user 53.14s system 113% cpu 1:43.63 total
./rmlint -T duplicates $HOME/data  66.47s user 54.32s system 113% cpu 1:46.13 total

./rmlint -T duplicates $HOME/data  67.20s user 56.00s system 113% cpu 1:48.55 total

fslint

./findup $HOME/data  129.46s user 40.77s system 111% cpu 2:32.05 total

./findup $HOME/data  129.75s user 40.53s system 111% cpu 2:32.10 total
./findup $HOME/data  129.58s user 40.82s system 111% cpu 2:32.28 total
./findup $HOME/data  129.89s user 40.80s system 112% cpu 2:32.30 total

./findup $HOME/data  130.47s user 40.34s system 112% cpu 2:32.36 total

fdupes

fdupes -q -r $HOME/data  43.16s user 170.29s system 96% cpu 3:41.87 total

fdupes -q -r $HOME/data  43.39s user 170.24s system 96% cpu 3:42.07 total
fdupes -q -r $HOME/data  42.88s user 170.87s system 96% cpu 3:42.13 total
fdupes -q -r $HOME/data  42.73s user 171.24s system 96% cpu 3:42.23 total

fdupes -q -r $HOME/data  43.64s user 170.83s system 96% cpu 3:42.86 total

fastdup

I was unable to get any times from fastdup as it errors out with “Too many open files”.

No Heartbleed for Heliod

Posted on 2014/04/10 by jyri

The Internet is ablaze with talk about the OpenSSL vulnerability nicknamed Heartbleed (CVE-2014-0160). It is, arguably, one of the worst SSL vulnerabilities in recent memory given how trivial it is to exploit. Attackers can, without leaving any trace and with zero effort, read up to 64K of data from the server (or client) address space. What’s there will vary, but may, if you get (un)lucky include private keys, passwords or other sensitive info.

Of course, it is not an SSL protocol vulnerability. It is a bug in the OpenSSL implementation. Those of you (us) running the heliod web server have had nothing to do this week since heliod fortunately does not use OpenSSL (it uses NSS). It is a relief, after running around at work to address the Heartbleed vulnerability, that I don’t have to do anything to fix my personal web servers which wisely run heliod!

If you’d also like to run the best performing and most secure web server around, check out heliod.

stdout

Collected Thoughts

dupd vs. jdupes

dupd 1.3 released

dupd 1.2 released

Some dupd performance improvements

Performance Improvements in dupd 1.2

The edge case: lots of files of the same size

Normal case: smaller set of files with no odd size distributions

No change: spinning rust

heliod 0.3

Duplicate detection with dupd

heliod relocated

Comparison of bicycle gear ratios

Gear Ratios

Roubaix

Enduro MTB

XX1/30

Duplicate file detection performance

Results

Details

The Data

The Machine

The Runs

dupd

rdfind

rmlint

fslint

fdupes

fastdup

No Heartbleed for Heliod