dupd vs. jdupes (take 2)

My previous comparison of dupd vs. jdupes prompted the author of jdupes to try it out on his system and write a similar comparison of dupd and jdupes from his perspective.

The TL;DR is you optimize what you measure. For the most part, jdupes is faster on Jody’s system/dataset and dupd is faster on mine. It’d be fun to do a deeper investigation on the details (and if I ever have some extra spare time I will) but for now I ran a few more rounds of tests to build on the two previous articles.

Methodology Notes

  • First and foremost, while it is fun to try to get optimal run times, I don’t want to focus on scenarios which are so optimized for performance that they are not realistic use cases for me. Thus:
    • SQLite overhead: Yes, dupd saves all the duplicate data into an sqlite database for future reference. This does add overhead and Jody’s article prompted me to try a few runs with the –nodb option (which causes dupd to skip creating the sqlite db and print to stdout instead, just like jdupes does). However, to me by far the most useful part of dupd is the interactive usage model it enables, which requires the sqlite database. So I won’t focus much on –nodb runs because I’d never run dupd that way and I want to focus on (my) real world usage.
    • File caches: This time I ran tests both with warmed up file caches (by repeating runs) and purged file caches (by explicitly clearing them prior to every run). For me, the warm file cache scenario is actually the one most closely matching real world usage because I tend to run a dupd scan, then work interactively on some subset of data, then run dupd scan again, repeat until tired. For someone whose workflow is to run a cold scan once and not re-scan until much later, the cold cache numbers will be more applicable.
  • I ran both dupd and jdupes with -q to eliminate informative output during the runs. It doesn’t make much difference in dupd but according to Jody this helps jdupes runtimes so I quieted both.
  • By default, dupd ignores hidden files and jdupes includes them. To make comparable runs, either use –hidden for dupd to include them or –nohidden for jdupes to exclude them. I decided to run with dupd –hidden.
  • The average time reported below from each set of runs is the average of all runs but excluding the slowest and fastest runs.

Results (SSD)

These runs are on a slow machine (Intel Atom S1260 2.00GHz) and SSD drive (Intel SSD 520) running Linux (debian) ext4.

Files scanned: 197,171
Total duplicates: 77,044

For my most common usage scenario, jdupes takes almost three times (~2.8x) longer to process the same file set. Running dupd with –nodb is marginally faster than dupd default run (but I wouldn’t really run it that way because the sqlite db is too convenient).

ssd-warm-cacheNext I tried clearing the file cache before every run. Here the dupd advantage is reduced, but jdupes still took about 1.4x longer than dupd.

ssd-cold-cacheResults (Slow Disk)

Jody’s tests show jdupes being faster in many scenarios, so I’d like to find a way to reproduce that. The answer is slow disks.. I have this old Mac Mini (2009) with an even slower disk which should do the trick. Let’s see. Fewer files, fewer duplicates, but the disk is so slow these runs take a while (so I only ran 5 repetitions instead of 7).

Files scanned: 62,347
Total duplicates: 13,982

Indeed, here jdupes has the advantage as dupd takes about 1.7x longer.

rust-warm-cacheA few notes on these Mac runs:

  • I did also run dupd with –nodb, but here it didn’t make any meaningful difference.
  • I also ran both dupd and jdupes with cold file cache, or at least maybe I did. I ran the purge(8) command prior to each run. It claims to: “Purge can be used to approximate initial boot conditions with a cold disk buffer cache for performance analysis”. However, it made no difference at all in measured times for dupd nor jdupes.

Conclusions

It seems that in terms of performance, you’ll do better with dupd if you’re on SSDs but if you’re on HDDs then jdupes can be faster. Ideally, try both and let us know!

Also, even though tuning and testing the performance is so much fun, ultimately usability matters even more. For me, the interactive workflow supported by dupd is what makes it special (but then, that’s why I wrote it so I’m biased ;-) and I couldn’t live without it.

Finally, thanks to Jody for fixing a bug in dupd that showed up only on XFS (which I don’t use so never noticed) and for prompting me to do a few additional enhancements.

 

Raw Data (commands and times)

SSD: dupd, normal usage

% repeat 7 time dupd scan -p $HOME -q --hidden
dupd scan -p $HOME -q --hidden  4.89s user 8.00s system 138% cpu 9.273 total
dupd scan -p $HOME -q --hidden  5.04s user 8.23s system 142% cpu 9.335 total
dupd scan -p $HOME -q --hidden  4.98s user 7.78s system 139% cpu 9.141 total
dupd scan -p $HOME -q --hidden  4.86s user 7.92s system 139% cpu 9.146 total
dupd scan -p $HOME -q --hidden  5.61s user 8.00s system 143% cpu 9.503 total
dupd scan -p $HOME -q --hidden  4.95s user 7.79s system 140% cpu 9.082 total
dupd scan -p $HOME -q --hidden  4.96s user 7.80s system 139% cpu 9.119 total

average = 9.20

SSD: dupd, clear file cache

% repeat 7 (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"; time dupd scan -p $HOME -q --hidden)
dupd scan -p $HOME -q --hidden  11.86s user 43.55s system 54% cpu 1:42.54 total
dupd scan -p $HOME -q --hidden  12.42s user 44.47s system 55% cpu 1:43.41 total
dupd scan -p $HOME -q --hidden  12.12s user 43.65s system 56% cpu 1:39.03 total
dupd scan -p $HOME -q --hidden  12.22s user 43.37s system 55% cpu 1:40.69 total
dupd scan -p $HOME -q --hidden  12.60s user 45.28s system 53% cpu 1:47.55 total
dupd scan -p $HOME -q --hidden  12.10s user 44.51s system 54% cpu 1:44.18 total
dupd scan -p $HOME -q --hidden  12.43s user 43.74s system 57% cpu 1:36.92 total

average = 101.97

SSD: dupd, do not create database

% repeat 7 time dupd scan -p $HOME -q --hidden  --nodb > results
dupd scan -p $HOME -q --hidden --nodb > results  4.26s user 7.70s system 136% cpu 8.785 total
dupd scan -p $HOME -q --hidden --nodb > results  4.36s user 7.54s system 136% cpu 8.710 total
dupd scan -p $HOME -q --hidden --nodb > results  4.28s user 7.69s system 136% cpu 8.770 total
dupd scan -p $HOME -q --hidden --nodb > results  4.23s user 7.64s system 136% cpu 8.708 total
dupd scan -p $HOME -q --hidden --nodb > results  4.34s user 7.58s system 136% cpu 8.757 total
dupd scan -p $HOME -q --hidden --nodb > results  4.19s user 7.66s system 135% cpu 8.736 total
dupd scan -p $HOME -q --hidden --nodb > results  4.58s user 7.75s system 140% cpu 8.772 total

average = 8.75

SSD: dupd, clear file cache and do not create database

repeat 7 (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"; time dupd scan -p $HOME -q --hidden  --nodb > results)
dupd scan -p $HOME -q --hidden --nodb > results  9.67s user 36.51s system 51% cpu 1:29.39 total
dupd scan -p $HOME -q --hidden --nodb > results  10.92s user 43.76s system 53% cpu 1:41.45 total
dupd scan -p $HOME -q --hidden --nodb > results  10.79s user 43.58s system 54% cpu 1:38.93 total
dupd scan -p $HOME -q --hidden --nodb > results  10.62s user 43.59s system 56% cpu 1:35.45 total
dupd scan -p $HOME -q --hidden --nodb > results  10.76s user 44.39s system 54% cpu 1:41.45 total
dupd scan -p $HOME -q --hidden --nodb > results  10.92s user 43.78s system 55% cpu 1:38.87 total
dupd scan -p $HOME -q --hidden --nodb > results  10.72s user 43.07s system 53% cpu 1:41.50 total

average = 99.23

SSD: jdupes, warm file cache

repeat 7 time jdupes -q -r $HOME > results
jdupes -q -r $HOME > results  10.54s user 14.91s system 99% cpu 25.626 total
jdupes -q -r $HOME > results  10.76s user 14.66s system 99% cpu 25.587 total
jdupes -q -r $HOME > results  10.68s user 14.86s system 99% cpu 25.725 total
jdupes -q -r $HOME > results  10.76s user 14.67s system 99% cpu 25.614 total
jdupes -q -r $HOME > results  10.62s user 14.76s system 99% cpu 25.549 total
jdupes -q -r $HOME > results  10.75s user 14.87s system 99% cpu 25.801 total
jdupes -q -r $HOME > results  10.48s user 14.87s system 99% cpu 25.527 total

average = 25.62

SSD: jdupes, clear file cache

repeat 7 (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"; time jdupes -q -r $HOME  > results)
jdupes -q -r $HOME > results  26.11s user 72.99s system 67% cpu 2:25.89 total
jdupes -q -r $HOME > results  26.54s user 71.06s system 68% cpu 2:22.01 total
jdupes -q -r $HOME > results  24.64s user 72.62s system 66% cpu 2:26.57 total
jdupes -q -r $HOME > results  26.01s user 70.05s system 68% cpu 2:20.15 total
jdupes -q -r $HOME > results  26.25s user 72.48s system 67% cpu 2:26.67 total
jdupes -q -r $HOME > results  24.63s user 70.70s system 67% cpu 2:20.77 total
jdupes -q -r $HOME > results  25.41s user 72.40s system 68% cpu 2:23.80 total

average = 143.81

Slow Disk: dupd, normal usage

% repeat 5 time dupd scan -p $HOME --hidden -q
dupd scan -p $HOME --hidden -q  4.62s user 29.72s system 2% cpu 21:05.70 total
dupd scan -p $HOME --hidden -q  4.38s user 29.88s system 2% cpu 22:34.14 total
dupd scan -p $HOME --hidden -q  4.78s user 30.09s system 2% cpu 21:29.52 total
dupd scan -p $HOME --hidden -q  4.37s user 29.07s system 2% cpu 21:18.10 total
dupd scan -p $HOME --hidden -q  4.39s user 29.24s system 2% cpu 21:11.19 total

average = 1279.60

Slow Disk: jdupes, warm file cache

% repeat 5 time jdupes -q -r $HOME > results
jdupes -q -r $HOME > results  8.88s user 31.70s system 5% cpu 13:05.37 total
jdupes -q -r $HOME > results  8.87s user 30.51s system 5% cpu 12:41.61 total
jdupes -q -r $HOME > results  8.80s user 30.56s system 4% cpu 13:30.56 total
jdupes -q -r $HOME > results  8.85s user 30.62s system 5% cpu 12:34.43 total
jdupes -q -r $HOME > results  8.80s user 30.18s system 5% cpu 12:32.14 total

average = 767.14

dupd vs. jdupes

Tonight I ran across jdupes which I had not seen before. It is a fork of the venerable fdupes with quite a few performance improvements. Performance?! Well I had to try it of course. Here are a few runs of jdupes and dupd on my home directory for comparison (using -A to skip hidden files which is the default in dupd):

% repeat 5 time ./jdupes -r $HOME -A  > out
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.06s user 10.96s system 99% cpu 14.156 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.07s user 10.82s system 99% cpu 14.029 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.00s user 11.01s system 99% cpu 14.156 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.22s user 11.03s system 98% cpu 14.414 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.04s user 10.87s system 99% cpu 14.042 total

So, consistently about 14 seconds.

% repeat 5 time dupd scan -p $HOME -q
dupd scan -p $HOME -q  3.99s user 6.76s system 139% cpu 7.691 total
dupd scan -p $HOME -q  4.16s user 6.28s system 140% cpu 7.416 total
dupd scan -p $HOME -q  4.13s user 6.53s system 141% cpu 7.540 total
dupd scan -p $HOME -q  3.98s user 6.39s system 139% cpu 7.405 total
dupd scan -p $HOME -q  4.00s user 6.44s system 140% cpu 7.404 total

About 7.5 seconds, or just under that, for dupd.

I still have a handful of ideas to make dupd faster, as I find some spare time I’ll try them out.

dupd 1.2 released

I just tagged the release of dupd 1.2, enjoy hunting those duplicates!

This time I included pre-built binaries for a few platforms. Probably mostly useful on OS X for those without dev tools intalled.

Some dupd performance improvements

Performance Improvements in dupd 1.2

Recently I’ve done a few performance improvements to dupd, motivated by one particular edge case file set I was working with a while back. That file set had very large numbers (over 100K) of files of the same size (these were log files from a production system where the content was always different but due to the structure of the files they tended to have the same size). This was a worst case scenario for dupd given the way it grouped files of the same size as potential duplicates. With the latest changes (in dupd 1.2) this scenario is dramatically faster (scan time reduced from about an hour to about five minutes – see below).

In more common scenarios these improvements don’t make a big difference but there is still some small benefit. Memory consumption is also reduced in dupd 1.2 (there is more room to reduce memory consumption that I might play with if I have time some day).

In a nutshell, dupd 1.2 should be either no slower, slightly faster or in some edge cases dramatically faster than dupd 1.1.

The edge case: lots of files of the same size

dupd_samesizesWith dupd 1.1 scan time was 59m57s which is what motivated me to improve it for that file set. Now with dupd 1.2, scan time for the same file set is only 4m57s! Mission accomplished.

The three main changes were:

  • ptr2end (Reduced time from 59m57s to 26m57s) – Simply store a pointer to the end of the size list instead of walking it. Normally the size lists are tiny, on average I see well under 10 elements. But when it grew to over 100K elements this made a huge difference.
  • local_memcmp (Reduced time from 26m57s to 20m36s) – Instead of using memcmp(3) always, use a local implementation when the buffers being compared are small. This made a surprising amount of difference.
  • hashlist_ptr (Reduced time from 20m36s to 4m57s) – As dupd processes file sets from the sizelist to the hashlists, it was copying the paths. Now, just copy pointers. This skips a lot of unnecessary strcpy(3)ing as well as reduces memory consumption.

Normal case: smaller set of files with no odd size distributions

That said, do these changes translate to any benefit on more “normal” file sets? Nowhere near as dramatically, but it’s still faster and uses less memory so that’s all good.

dupd_homeThese scans are from my $HOME dir on one machine, scan time reduced from 10.6s (average of 5 runs) to 8.1s, an improvement of about 23%, not bad at all.

No change: spinning rust

All the numbers above are from machines with SSDs. I also tested on a couple machines with traditional hard drives and there was zero change in performance. No graph, it’s just a straight line ;-)

With normal hard drives, the file I/O time so completely dominates run time that there is no difference from any dupd improvements.

(I suspect the edge case file set would have seen improvement even on spinning rust, but I didn’t have the chance to test that scenario.)

 

heliod relocated

It’s unfortunate to downgrade from mercurial to git but overall should be for the better.

I have moved the heliod source code from its previous home to github here: https://github.com/jvirkki/heliod

I also copied the release-0.2 binaries (built on debian-6, Solaris 10×86 and Solaris 10 SPARC) to the github release files: https://github.com/jvirkki/heliod/releases

If for some reason you want to download the release-0.1 binaries, they are still available on sourceforge here: http://sourceforge.net/projects/heliod/files/

 

Comparison of bicycle gear ratios

Gear Ratios

Linking these here for my future reference…

Roubaix

50-39-30 chainring, 12-30 Ultegra cassette (10 speed)

roubaix

Speed @90 rpm

    |    12    13    14    15    17    19    21    24    27    30
----+------------------------------------------------------------
 50 |  29.3  27.0  25.1  23.4  20.6  18.5  16.7  14.6  13.0  11.7
 39 |  22.8  21.1  19.6  18.3  16.1  14.4  13.0  11.4  10.1   9.1
 30 |  17.6  16.2  15.0  14.0  12.4  11.1  10.0   8.8   7.8   7.0

Enduro MTB

32 chainring, 12-36 cassette (9 speed)

mtbSpeed @90 rpm

     |    12    14    16    18    21    24    28    32    36
 ----+------------------------------------------------------
  32 |  18.6  15.9  13.9  12.4  10.6   9.3   8.0   7.0   6.2

XX1/30

xx1
Speed @90 rpm

    |    10    12    14    16    18    21    24    28    32    36    42
----+------------------------------------------------------------------
 30 |  21.8  18.2  15.6  13.6  12.1  10.4   9.1   7.8   6.8   6.1   5.2

Duplicate file detection performance

Just over two years ago I tested my dupd against a couple other duplicate detection tools.

Recently I’ve been doing some duplicate cleanup again and while at it I added a few features to dupd and called it version 1.1. So this is as good time as any to revisit the previous numbers.

I tested a small subset of my file server data using six duplicate detection tools:

Results

The graph shows the time (in seconds) it took each utility to scan and identify all duplicates in my sample set. I’m happy to see dupd took less than half the time of the next fastest option (rdfind) and just over seven times faster than fdupes.

duplicates

Details

The Data

The sample set is 18GB in size and has 392,378 files. There are a total of 117,261 duplicates.

The Machine

I ran this on my small home server, which has an Intel Atom CPU S1260 @ 2.00GHz (4 cores), 8GB RAM, Intel 520 series SSD.

The Runs

For each tool, first I ran it once and ignored the time, just to populate file caches. Then I ran it five times in a row. Discarding the fastest and slowest time, I averaged the remaining three runs to come up with the time shown in the graph above. For most of the tools, the scan times were very consistent from run to run.

dupd

dupd scan --path $HOME/data -q  13.31s user 15.94s system 99% cpu 29.533 total

dupd scan --path $HOME/data -q  13.17s user 16.09s system 99% cpu 29.539 total
dupd scan --path $HOME/data -q  13.17s user 16.13s system 99% cpu 29.572 total
dupd scan --path $HOME/data -q  13.28s user 16.04s system 99% cpu 29.604 total

dupd scan --path $HOME/data -q  13.59s user 15.74s system 99% cpu 29.605 total

rdfind

rdfind -dryrun true $HOME/data  49.28s user 24.98s system 99% cpu 1:14.75 total

rdfind -dryrun true $HOME/data  49.08s user 25.29s system 99% cpu 1:14.87 total
rdfind -dryrun true $HOME/data  48.93s user 25.52s system 99% cpu 1:14.92 total
rdfind -dryrun true $HOME/data  48.92s user 25.53s system 99% cpu 1:14.95 total

rdfind -dryrun true $HOME/data  49.52s user 25.09s system 99% cpu 1:15.11 total

rmlint

./rmlint -T duplicates $HOME/data  63.53s user 52.55s system 113% cpu 1:42.69 total

./rmlint -T duplicates $HOME/data  64.67s user 52.46s system 113% cpu 1:43.43 total
./rmlint -T duplicates $HOME/data  64.01s user 53.14s system 113% cpu 1:43.63 total
./rmlint -T duplicates $HOME/data  66.47s user 54.32s system 113% cpu 1:46.13 total

./rmlint -T duplicates $HOME/data  67.20s user 56.00s system 113% cpu 1:48.55 total

fslint

./findup $HOME/data  129.46s user 40.77s system 111% cpu 2:32.05 total

./findup $HOME/data  129.75s user 40.53s system 111% cpu 2:32.10 total
./findup $HOME/data  129.58s user 40.82s system 111% cpu 2:32.28 total
./findup $HOME/data  129.89s user 40.80s system 112% cpu 2:32.30 total

./findup $HOME/data  130.47s user 40.34s system 112% cpu 2:32.36 total

fdupes

fdupes -q -r $HOME/data  43.16s user 170.29s system 96% cpu 3:41.87 total

fdupes -q -r $HOME/data  43.39s user 170.24s system 96% cpu 3:42.07 total
fdupes -q -r $HOME/data  42.88s user 170.87s system 96% cpu 3:42.13 total
fdupes -q -r $HOME/data  42.73s user 171.24s system 96% cpu 3:42.23 total

fdupes -q -r $HOME/data  43.64s user 170.83s system 96% cpu 3:42.86 total

fastdup

I was unable to get any times from fastdup as it errors out with “Too many open files”.