Duplicate finder performance (2018 edition)

It has been over three years since I did a performance comparison of a group of duplicate file finder utilities. Having recently released dupd 1.7, I finally took some time to refresh my benchmarks.

The Contenders

This time around I tested the following utilities:

I had also intended to test the following, but these crash during runs so didn’t get any numbers: dups, fastdup, identix, py_duplicates.

Based on these and the previous results, I think going forward I’ll only test these: dupd, jdupes, rmlint and rdfind.

If anyone knows of another duplicate finder with solid performance worth comparing, please let me know!

The Files

The file set has a total of 154205 files. Of these, 18677 are unique sizes and 1216 were otherwise ignored (zero-sized or not regular files). This leaves 134312 files for further processing. Of these, there are 44926 duplicates in 13828 groups (and thus, 89386 unique files).

The files are all “real” files. That is, they are all taken from my home file server instead of artificially constructed for the benchmark. There is a mix of all types of files such as source code, documents, images, videos and other misc stuff that accumulates on the file server.

In the past I’ve generally focused on testing on SSD media only, as that’s what I generally use myself. To be more thorough, this time I installed a HDD on the same machine and duplicated the exact same set of files on both devices.

The cache

Of course, when a process reads file content it doesn’t necessarily trigger a read from the underlying device, be it SSD or HDD, because the content may already be in the file cache (and often is).

This time I’ve run each utility/media combination twice. Once where the file cache is cleared prior to every run and another where the cache is left undisturbed from run to run.

In my experience, the warm cache runs are more representative of real world usage because when I’m working on duplicates I run the tool many times as I clean up files. For the sake of more thorough results, I’ve reported both scenarios.

The methodology

For each tool/media (SSD and HDD) combination, the runs were done as follows:

  1. Clear the filesystem cache (echo 3 > /proc/sys/vm/drop_caches).
  2. Run the scan once, discarding the result.
  3. Repeat 5 times:
    1. For the no-cache runs, clear the cache again.
    2. Run and time the tool.
  4. Report the average of the above five runs as the result.

The command lines and individual run times are included at the bottom of this article.

Results

1. HDD with cache

HDDcache2. HDD without cache

HDDNOcache3. SSD with cache

SSDcache4. SSD without cache

SSDNOcacheSummary

As you can see above, the ranking varies depending on each scenario. However, I’m happy to see dupd is the fastest in three of four scenarios and a very close second in the fourth.

To conclude with some kind of ranking, let’s look at the average finishing position of each tool:

Tool aveRAGE ranking
dupd 1.3 1, 1, 1, 2
rmlint 3.0 2, 7, 1, 2
jdupes 3.8 3, 2, 5, 5
rdfind 3.8 5, 4, 3, 3
duff 4.5 4, 3, 4, 7
fdupes 5.8 6, 5, 6, 6
fslint 6.0 7, 6, 7, 4

The Raw Data

-----[ rmlint : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 92.07
Running 5 times (timeout=3600): rmlint -o fdupes /hdd/files
Run 0 took 27.83
Run 1 took 27.75
Run 2 took 27.59
Run 3 took 27.75
Run 4 took 27.66
AVERAGE TIME:
27.716


-----[ jdupes : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 268.6
Running 5 times (timeout=3600): jdupes -A -H -r -q /hdd/files
Run 0 took 5.96
Run 1 took 5.94
Run 2 took 5.96
Run 3 took 5.93
Run 4 took 5.99
AVERAGE TIME:
5.956


-----[ dupd-hdd : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 46.78
Running 5 times (timeout=3600): dupd scan -q -p /hdd/files
Run 0 took 3.24
Run 1 took 3.29
Run 2 took 3.21
Run 3 took 3.21
Run 4 took 3.25
AVERAGE TIME:
3.24


-----[ rdfind : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 90.25
Running 5 times (timeout=3600): rdfind -n true /hdd/files
Run 0 took 8.52
Run 1 took 8.49
Run 2 took 8.53
Run 3 took 8.48
Run 4 took 8.41
AVERAGE TIME:
8.486


-----[ fslint : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 103.64
Running 5 times (timeout=3600): findup /hdd/files
Run 0 took 20.38
Run 1 took 20.4
Run 2 took 20.36
Run 3 took 20.36
Run 4 took 20.39
AVERAGE TIME:
20.378


-----[ fdupes : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 278.11
Running 5 times (timeout=3600): fdupes -A -H -r -q /hdd/files
Run 0 took 15.76
Run 1 took 15.78
Run 2 took 15.72
Run 3 took 15.74
Run 4 took 15.88
AVERAGE TIME:
15.776


-----[ duff : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 935.51
Running 5 times (timeout=3600): duff -r -z /hdd/files
Run 0 took 7.03
Run 1 took 7.01
Run 2 took 6.98
Run 3 took 6.99
Run 4 took 6.99
AVERAGE TIME:
7


-----[ rmlint : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 90.59
Running 5 times (timeout=3600): rmlint -o fdupes /hdd/files
Run 0 took 89.86
Run 1 took 89.4
Run 2 took 90.44
Run 3 took 89.87
Run 4 took 90.84
AVERAGE TIME:
90.082


-----[ jdupes : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 269.69
Running 5 times (timeout=3600): jdupes -A -H -r -q /hdd/files
Run 0 took 268.97
Run 1 took 270.07
Run 2 took 268.52
Run 3 took 268.95
Run 4 took 269
AVERAGE TIME:
269.102


-----[ dupd-hdd : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 46.26
Running 5 times (timeout=3600): dupd scan -q -p /hdd/files
Run 0 took 46.37
Run 1 took 46.43
Run 2 took 46.24
Run 3 took 46.68
Run 4 took 46.62
AVERAGE TIME:
46.468


-----[ rdfind : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 86.59
Running 5 times (timeout=3600): rdfind -n true /hdd/files
Run 0 took 86.48
Run 1 took 87.02
Run 2 took 86.55
Run 3 took 86.57
Run 4 took 86.75
AVERAGE TIME:
86.674


-----[ fslint : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 103.29
Running 5 times (timeout=3600): findup /hdd/files
Run 0 took 103.49
Run 1 took 103.64
Run 2 took 102.97
Run 3 took 103.16
Run 4 took 103.28
AVERAGE TIME:
103.308


-----[ fdupes : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 276.02
Running 5 times (timeout=3600): fdupes -A -H -r -q /hdd/files
Run 0 took 276.88
Run 1 took 276.18
Run 2 took 276.83
Run 3 took 277.87
Run 4 took 276.99
AVERAGE TIME:
276.95


-----[ duff : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 935.56
Running 5 times (timeout=3600): duff -r -z /hdd/files
Run 0 took 936.06
Run 1 took 936.87
Run 2 took 936.58
Run 3 took 937.01
Run 4 took 935.95
AVERAGE TIME:
936.494


-----[ rmlint : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 18.62
Running 5 times (timeout=3600): rmlint -o fdupes /ssd/files
Run 0 took 6.38
Run 1 took 6.33
Run 2 took 6.3
Run 3 took 6.32
Run 4 took 6.32
AVERAGE TIME:
6.33


-----[ jdupes : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 35.12
Running 5 times (timeout=3600): jdupes -A -H -r -q /ssd/files
Run 0 took 6.89
Run 1 took 6.84
Run 2 took 6.88
Run 3 took 6.83
Run 4 took 6.91
AVERAGE TIME:
6.87


-----[ dupd-hdd : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 19.85
Running 5 times (timeout=3600): dupd scan -q -p /ssd/files
Run 0 took 3.34
Run 1 took 3.17
Run 2 took 3.25
Run 3 took 3.3
Run 4 took 3.29
AVERAGE TIME:
3.27


-----[ rdfind : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 31.43
Running 5 times (timeout=3600): rdfind -n true /ssd/files
Run 0 took 8.5
Run 1 took 8.38
Run 2 took 8.42
Run 3 took 8.39
Run 4 took 8.38
AVERAGE TIME:
8.414


-----[ fslint : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 44.67
Running 5 times (timeout=3600): findup /ssd/files
Run 0 took 20.63
Run 1 took 20.58
Run 2 took 20.54
Run 3 took 20.54
Run 4 took 20.53
AVERAGE TIME:
20.564


-----[ fdupes : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 42.13
Running 5 times (timeout=3600): fdupes -A -H -r -q /ssd/files
Run 0 took 15.68
Run 1 took 15.52
Run 2 took 15.53
Run 3 took 15.56
Run 4 took 15.54
AVERAGE TIME:
15.566


-----[ duff : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 32.54
Running 5 times (timeout=3600): duff -r -z /ssd/files
Run 0 took 7
Run 1 took 6.96
Run 2 took 6.98
Run 3 took 6.95
Run 4 took 6.95
AVERAGE TIME:
6.968


-----[ rmlint : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 17.39
Running 5 times (timeout=3600): rmlint -o fdupes /ssd/files
Run 0 took 17.29
Run 1 took 17.21
Run 2 took 17.25
Run 3 took 17.24
Run 4 took 17.31
AVERAGE TIME:
17.26


-----[ jdupes : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 34.36
Running 5 times (timeout=3600): jdupes -A -H -r -q /ssd/files
Run 0 took 34.3
Run 1 took 34.35
Run 2 took 34.48
Run 3 took 34.34
Run 4 took 34.36
AVERAGE TIME:
34.366


-----[ dupd-hdd : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 19.7
Running 5 times (timeout=3600): dupd scan -q -p /ssd/files
Run 0 took 19.67
Run 1 took 19.65
Run 2 took 19.66
Run 3 took 19.65
Run 4 took 19.51
AVERAGE TIME:
19.628


-----[ rdfind : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 30.93
Running 5 times (timeout=3600): rdfind -n true /ssd/files
Run 0 took 30.7
Run 1 took 30.61
Run 2 took 30.72
Run 3 took 30.8
Run 4 took 30.79
AVERAGE TIME:
30.724


-----[ fslint : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 44.41
Running 5 times (timeout=3600): findup /ssd/files
Run 0 took 44.23
Run 1 took 44.3
Run 2 took 44.44
Run 3 took 44.24
Run 4 took 44.41
AVERAGE TIME:
44.324


-----[ fdupes : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 42.05
Running 5 times (timeout=3600): fdupes -A -H -r -q /ssd/files
Run 0 took 41.79
Run 1 took 41.79
Run 2 took 41.79
Run 3 took 41.8
Run 4 took 41.92
AVERAGE TIME:
41.818


-----[ duff : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 32.45
Running 5 times (timeout=3600): duff -r -z /ssd/files
Run 0 took 32.49
Run 1 took 32.48
Run 2 took 32.52
Run 3 took 32.49
Run 4 took 32.48
AVERAGE TIME:
32.492

 

dupd 1.7 released

I just tagged release 1.7 of dupd.

See the ChangeLog for a list of changes.

The major change is that the SSD mode has been removed. That’s a bit sad because as I’ve written over the years, the SSD mode is/was faster in some scenarios when reading off a SSD. However, as I’ve also mentioned, the drawback was that the SSD mode could be vastly slower in the “wrong” circumstances.

As of version 1.6 I intended to keep both so the best one can be used in each situation. During the 1.7 development I ended up doing internal code refactoring on the HDD mode to make it simpler and faster. The end result was that the two implementations diverged ever further apart to the point that they barely shared any code. And the HDD mode became almost as fast as the SSD mode at its best, although not quite.

Maintaining the two separate code paths was becoming too much of a burden though. Given the nature of a hobby open source project I can only dedicate so much (read: little) time to it so, sadly, the SSD mode is gone. On the positive side, this allowed me to remove a lot of code, which is nice.

dupd 1.6 released

I just tagged release 1.6 of dupd.

See the ChangeLog for a list of changes.

As usual, there is not much in the way of directly user-visible changes. It should continue to work as it did. If not, let me know.

One change worth pointing out is that the HDD mode is now the default (a new option, –ssd, allows selecting the SSD mode). I’m on SSDs myself and generally use the SSD mode so why the change? Well, the HDD mode is a more conservative default because even though the SSD mode is often faster it is also true that the SSD mode can be dramatically slower in worst-case scenarios.

One downside of the HDD mode is that it uses more RAM. I haven’t seen it use excessive memory with any real-world file sets I have but in theory it could. If you run into this, there is a new option to limit buffer sizes (see man page).

Although not user-visible, this release does contain a significant rewrite of several subsystems (dir and path storage, thread work queues). So let me know if any bugs surface.

dupd: Introducing HDD mode

For most of its development, my duplicate detection utility dupd has been optimized for SSDs only. This wasn’t an intentional choice per se, just a side effect of the fact that the various machines I tend to test and develop on are all SSD based.

The 1.4 release introduces support for a new scan mode which works better on hard disk drives (HDDs). While this mode does have additional overhead (both CPU and RAM) compared to the default mode (which makes it generally slower if the data is on a SSD) it more than makes up for it by reducing the time spent waiting for I/O if the file data is scattered on spinning rust.

Here are some runs from a HDD-based machine I have. The file set consists of general data of all kinds from a subset of my home directory. There are 148,933 files with 44,339 duplicates.

The timings are the average of 5 runs, with the filesystem cache cleared (echo 3 > /proc/sys/vm/drop_caches) before each run (this is highly artificial, of course, as you’d never ever do that in real life, but interesting for testing a worst-case scenario).

dupd_14_scanHere the –hdd mode is almost 12x faster (68 seconds vs. 813 seconds)!

It is important to note that if the file data being scanned is in the filesystem cache then you are better off using the default mode even if the underlying files are stored on a HDD. If you are cleaning duplicates “the dupd way” and the machine has enough RAM then it is more likely than not that most or all of the data will be in the cache in all runs except the first one.

My rule of thumb recommendation on a HDD-based machine is to always run the first scan using the –hdd mode and then try subsequent scans both with and without the –hdd mode to see which works best on your hardware and with that particular data set. As with all things performance, YMMV!

 

dupd vs. jdupes (take 2)

My previous comparison of dupd vs. jdupes prompted the author of jdupes to try it out on his system and write a similar comparison of dupd and jdupes from his perspective.

The TL;DR is you optimize what you measure. For the most part, jdupes is faster on Jody’s system/dataset and dupd is faster on mine. It’d be fun to do a deeper investigation on the details (and if I ever have some extra spare time I will) but for now I ran a few more rounds of tests to build on the two previous articles.

Methodology Notes

  • First and foremost, while it is fun to try to get optimal run times, I don’t want to focus on scenarios which are so optimized for performance that they are not realistic use cases for me. Thus:
    • SQLite overhead: Yes, dupd saves all the duplicate data into an sqlite database for future reference. This does add overhead and Jody’s article prompted me to try a few runs with the –nodb option (which causes dupd to skip creating the sqlite db and print to stdout instead, just like jdupes does). However, to me by far the most useful part of dupd is the interactive usage model it enables, which requires the sqlite database. So I won’t focus much on –nodb runs because I’d never run dupd that way and I want to focus on (my) real world usage.
    • File caches: This time I ran tests both with warmed up file caches (by repeating runs) and purged file caches (by explicitly clearing them prior to every run). For me, the warm file cache scenario is actually the one most closely matching real world usage because I tend to run a dupd scan, then work interactively on some subset of data, then run dupd scan again, repeat until tired. For someone whose workflow is to run a cold scan once and not re-scan until much later, the cold cache numbers will be more applicable.
  • I ran both dupd and jdupes with -q to eliminate informative output during the runs. It doesn’t make much difference in dupd but according to Jody this helps jdupes runtimes so I quieted both.
  • By default, dupd ignores hidden files and jdupes includes them. To make comparable runs, either use –hidden for dupd to include them or –nohidden for jdupes to exclude them. I decided to run with dupd –hidden.
  • The average time reported below from each set of runs is the average of all runs but excluding the slowest and fastest runs.

Results (SSD)

These runs are on a slow machine (Intel Atom S1260 2.00GHz) and SSD drive (Intel SSD 520) running Linux (debian) ext4.

Files scanned: 197,171
Total duplicates: 77,044

For my most common usage scenario, jdupes takes almost three times (~2.8x) longer to process the same file set. Running dupd with –nodb is marginally faster than dupd default run (but I wouldn’t really run it that way because the sqlite db is too convenient).

ssd-warm-cacheNext I tried clearing the file cache before every run. Here the dupd advantage is reduced, but jdupes still took about 1.4x longer than dupd.

ssd-cold-cacheResults (Slow Disk)

Jody’s tests show jdupes being faster in many scenarios, so I’d like to find a way to reproduce that. The answer is slow disks.. I have this old Mac Mini (2009) with an even slower disk which should do the trick. Let’s see. Fewer files, fewer duplicates, but the disk is so slow these runs take a while (so I only ran 5 repetitions instead of 7).

Files scanned: 62,347
Total duplicates: 13,982

Indeed, here jdupes has the advantage as dupd takes about 1.7x longer.

rust-warm-cacheA few notes on these Mac runs:

  • I did also run dupd with –nodb, but here it didn’t make any meaningful difference.
  • I also ran both dupd and jdupes with cold file cache, or at least maybe I did. I ran the purge(8) command prior to each run. It claims to: “Purge can be used to approximate initial boot conditions with a cold disk buffer cache for performance analysis”. However, it made no difference at all in measured times for dupd nor jdupes.

Conclusions

It seems that in terms of performance, you’ll do better with dupd if you’re on SSDs but if you’re on HDDs then jdupes can be faster. Ideally, try both and let us know!

Also, even though tuning and testing the performance is so much fun, ultimately usability matters even more. For me, the interactive workflow supported by dupd is what makes it special (but then, that’s why I wrote it so I’m biased ;-) and I couldn’t live without it.

Finally, thanks to Jody for fixing a bug in dupd that showed up only on XFS (which I don’t use so never noticed) and for prompting me to do a few additional enhancements.

 

Raw Data (commands and times)

SSD: dupd, normal usage

% repeat 7 time dupd scan -p $HOME -q --hidden
dupd scan -p $HOME -q --hidden  4.89s user 8.00s system 138% cpu 9.273 total
dupd scan -p $HOME -q --hidden  5.04s user 8.23s system 142% cpu 9.335 total
dupd scan -p $HOME -q --hidden  4.98s user 7.78s system 139% cpu 9.141 total
dupd scan -p $HOME -q --hidden  4.86s user 7.92s system 139% cpu 9.146 total
dupd scan -p $HOME -q --hidden  5.61s user 8.00s system 143% cpu 9.503 total
dupd scan -p $HOME -q --hidden  4.95s user 7.79s system 140% cpu 9.082 total
dupd scan -p $HOME -q --hidden  4.96s user 7.80s system 139% cpu 9.119 total

average = 9.20

SSD: dupd, clear file cache

% repeat 7 (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"; time dupd scan -p $HOME -q --hidden)
dupd scan -p $HOME -q --hidden  11.86s user 43.55s system 54% cpu 1:42.54 total
dupd scan -p $HOME -q --hidden  12.42s user 44.47s system 55% cpu 1:43.41 total
dupd scan -p $HOME -q --hidden  12.12s user 43.65s system 56% cpu 1:39.03 total
dupd scan -p $HOME -q --hidden  12.22s user 43.37s system 55% cpu 1:40.69 total
dupd scan -p $HOME -q --hidden  12.60s user 45.28s system 53% cpu 1:47.55 total
dupd scan -p $HOME -q --hidden  12.10s user 44.51s system 54% cpu 1:44.18 total
dupd scan -p $HOME -q --hidden  12.43s user 43.74s system 57% cpu 1:36.92 total

average = 101.97

SSD: dupd, do not create database

% repeat 7 time dupd scan -p $HOME -q --hidden  --nodb > results
dupd scan -p $HOME -q --hidden --nodb > results  4.26s user 7.70s system 136% cpu 8.785 total
dupd scan -p $HOME -q --hidden --nodb > results  4.36s user 7.54s system 136% cpu 8.710 total
dupd scan -p $HOME -q --hidden --nodb > results  4.28s user 7.69s system 136% cpu 8.770 total
dupd scan -p $HOME -q --hidden --nodb > results  4.23s user 7.64s system 136% cpu 8.708 total
dupd scan -p $HOME -q --hidden --nodb > results  4.34s user 7.58s system 136% cpu 8.757 total
dupd scan -p $HOME -q --hidden --nodb > results  4.19s user 7.66s system 135% cpu 8.736 total
dupd scan -p $HOME -q --hidden --nodb > results  4.58s user 7.75s system 140% cpu 8.772 total

average = 8.75

SSD: dupd, clear file cache and do not create database

repeat 7 (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"; time dupd scan -p $HOME -q --hidden  --nodb > results)
dupd scan -p $HOME -q --hidden --nodb > results  9.67s user 36.51s system 51% cpu 1:29.39 total
dupd scan -p $HOME -q --hidden --nodb > results  10.92s user 43.76s system 53% cpu 1:41.45 total
dupd scan -p $HOME -q --hidden --nodb > results  10.79s user 43.58s system 54% cpu 1:38.93 total
dupd scan -p $HOME -q --hidden --nodb > results  10.62s user 43.59s system 56% cpu 1:35.45 total
dupd scan -p $HOME -q --hidden --nodb > results  10.76s user 44.39s system 54% cpu 1:41.45 total
dupd scan -p $HOME -q --hidden --nodb > results  10.92s user 43.78s system 55% cpu 1:38.87 total
dupd scan -p $HOME -q --hidden --nodb > results  10.72s user 43.07s system 53% cpu 1:41.50 total

average = 99.23

SSD: jdupes, warm file cache

repeat 7 time jdupes -q -r $HOME > results
jdupes -q -r $HOME > results  10.54s user 14.91s system 99% cpu 25.626 total
jdupes -q -r $HOME > results  10.76s user 14.66s system 99% cpu 25.587 total
jdupes -q -r $HOME > results  10.68s user 14.86s system 99% cpu 25.725 total
jdupes -q -r $HOME > results  10.76s user 14.67s system 99% cpu 25.614 total
jdupes -q -r $HOME > results  10.62s user 14.76s system 99% cpu 25.549 total
jdupes -q -r $HOME > results  10.75s user 14.87s system 99% cpu 25.801 total
jdupes -q -r $HOME > results  10.48s user 14.87s system 99% cpu 25.527 total

average = 25.62

SSD: jdupes, clear file cache

repeat 7 (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"; time jdupes -q -r $HOME  > results)
jdupes -q -r $HOME > results  26.11s user 72.99s system 67% cpu 2:25.89 total
jdupes -q -r $HOME > results  26.54s user 71.06s system 68% cpu 2:22.01 total
jdupes -q -r $HOME > results  24.64s user 72.62s system 66% cpu 2:26.57 total
jdupes -q -r $HOME > results  26.01s user 70.05s system 68% cpu 2:20.15 total
jdupes -q -r $HOME > results  26.25s user 72.48s system 67% cpu 2:26.67 total
jdupes -q -r $HOME > results  24.63s user 70.70s system 67% cpu 2:20.77 total
jdupes -q -r $HOME > results  25.41s user 72.40s system 68% cpu 2:23.80 total

average = 143.81

Slow Disk: dupd, normal usage

% repeat 5 time dupd scan -p $HOME --hidden -q
dupd scan -p $HOME --hidden -q  4.62s user 29.72s system 2% cpu 21:05.70 total
dupd scan -p $HOME --hidden -q  4.38s user 29.88s system 2% cpu 22:34.14 total
dupd scan -p $HOME --hidden -q  4.78s user 30.09s system 2% cpu 21:29.52 total
dupd scan -p $HOME --hidden -q  4.37s user 29.07s system 2% cpu 21:18.10 total
dupd scan -p $HOME --hidden -q  4.39s user 29.24s system 2% cpu 21:11.19 total

average = 1279.60

Slow Disk: jdupes, warm file cache

% repeat 5 time jdupes -q -r $HOME > results
jdupes -q -r $HOME > results  8.88s user 31.70s system 5% cpu 13:05.37 total
jdupes -q -r $HOME > results  8.87s user 30.51s system 5% cpu 12:41.61 total
jdupes -q -r $HOME > results  8.80s user 30.56s system 4% cpu 13:30.56 total
jdupes -q -r $HOME > results  8.85s user 30.62s system 5% cpu 12:34.43 total
jdupes -q -r $HOME > results  8.80s user 30.18s system 5% cpu 12:32.14 total

average = 767.14

dupd vs. jdupes

Tonight I ran across jdupes which I had not seen before. It is a fork of the venerable fdupes with quite a few performance improvements. Performance?! Well I had to try it of course. Here are a few runs of jdupes and dupd on my home directory for comparison (using -A to skip hidden files which is the default in dupd):

% repeat 5 time ./jdupes -r $HOME -A  > out
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.06s user 10.96s system 99% cpu 14.156 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.07s user 10.82s system 99% cpu 14.029 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.00s user 11.01s system 99% cpu 14.156 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.22s user 11.03s system 98% cpu 14.414 total
Examining 164413 files, 22108 dirs (in 1 specified)
./jdupes -r $HOME -A > out  3.04s user 10.87s system 99% cpu 14.042 total

So, consistently about 14 seconds.

% repeat 5 time dupd scan -p $HOME -q
dupd scan -p $HOME -q  3.99s user 6.76s system 139% cpu 7.691 total
dupd scan -p $HOME -q  4.16s user 6.28s system 140% cpu 7.416 total
dupd scan -p $HOME -q  4.13s user 6.53s system 141% cpu 7.540 total
dupd scan -p $HOME -q  3.98s user 6.39s system 139% cpu 7.405 total
dupd scan -p $HOME -q  4.00s user 6.44s system 140% cpu 7.404 total

About 7.5 seconds, or just under that, for dupd.

I still have a handful of ideas to make dupd faster, as I find some spare time I’ll try them out.

dupd 1.2 released

I just tagged the release of dupd 1.2, enjoy hunting those duplicates!

This time I included pre-built binaries for a few platforms. Probably mostly useful on OS X for those without dev tools intalled.

Some dupd performance improvements

Performance Improvements in dupd 1.2

Recently I’ve done a few performance improvements to dupd, motivated by one particular edge case file set I was working with a while back. That file set had very large numbers (over 100K) of files of the same size (these were log files from a production system where the content was always different but due to the structure of the files they tended to have the same size). This was a worst case scenario for dupd given the way it grouped files of the same size as potential duplicates. With the latest changes (in dupd 1.2) this scenario is dramatically faster (scan time reduced from about an hour to about five minutes – see below).

In more common scenarios these improvements don’t make a big difference but there is still some small benefit. Memory consumption is also reduced in dupd 1.2 (there is more room to reduce memory consumption that I might play with if I have time some day).

In a nutshell, dupd 1.2 should be either no slower, slightly faster or in some edge cases dramatically faster than dupd 1.1.

The edge case: lots of files of the same size

dupd_samesizesWith dupd 1.1 scan time was 59m57s which is what motivated me to improve it for that file set. Now with dupd 1.2, scan time for the same file set is only 4m57s! Mission accomplished.

The three main changes were:

  • ptr2end (Reduced time from 59m57s to 26m57s) – Simply store a pointer to the end of the size list instead of walking it. Normally the size lists are tiny, on average I see well under 10 elements. But when it grew to over 100K elements this made a huge difference.
  • local_memcmp (Reduced time from 26m57s to 20m36s) – Instead of using memcmp(3) always, use a local implementation when the buffers being compared are small. This made a surprising amount of difference.
  • hashlist_ptr (Reduced time from 20m36s to 4m57s) – As dupd processes file sets from the sizelist to the hashlists, it was copying the paths. Now, just copy pointers. This skips a lot of unnecessary strcpy(3)ing as well as reduces memory consumption.

Normal case: smaller set of files with no odd size distributions

That said, do these changes translate to any benefit on more “normal” file sets? Nowhere near as dramatically, but it’s still faster and uses less memory so that’s all good.

dupd_homeThese scans are from my $HOME dir on one machine, scan time reduced from 10.6s (average of 5 runs) to 8.1s, an improvement of about 23%, not bad at all.

No change: spinning rust

All the numbers above are from machines with SSDs. I also tested on a couple machines with traditional hard drives and there was zero change in performance. No graph, it’s just a straight line ;-)

With normal hard drives, the file I/O time so completely dominates run time that there is no difference from any dupd improvements.

(I suspect the edge case file set would have seen improvement even on spinning rust, but I didn’t have the chance to test that scenario.)