Duplicate finder performance (2018 edition)

It has been over three years since I did a performance comparison of a group of duplicate file finder utilities. Having recently released dupd 1.7, I finally took some time to refresh my benchmarks.

The Contenders

This time around I tested the following utilities:

I had also intended to test the following, but these crash during runs so didn’t get any numbers: dups, fastdup, identix, py_duplicates.

Based on these and the previous results, I think going forward I’ll only test these: dupd, jdupes, rmlint and rdfind.

If anyone knows of another duplicate finder with solid performance worth comparing, please let me know!

The Files

The file set has a total of 154205 files. Of these, 18677 are unique sizes and 1216 were otherwise ignored (zero-sized or not regular files). This leaves 134312 files for further processing. Of these, there are 44926 duplicates in 13828 groups (and thus, 89386 unique files).

The files are all “real” files. That is, they are all taken from my home file server instead of artificially constructed for the benchmark. There is a mix of all types of files such as source code, documents, images, videos and other misc stuff that accumulates on the file server.

In the past I’ve generally focused on testing on SSD media only, as that’s what I generally use myself. To be more thorough, this time I installed a HDD on the same machine and duplicated the exact same set of files on both devices.

The cache

Of course, when a process reads file content it doesn’t necessarily trigger a read from the underlying device, be it SSD or HDD, because the content may already be in the file cache (and often is).

This time I’ve run each utility/media combination twice. Once where the file cache is cleared prior to every run and another where the cache is left undisturbed from run to run.

In my experience, the warm cache runs are more representative of real world usage because when I’m working on duplicates I run the tool many times as I clean up files. For the sake of more thorough results, I’ve reported both scenarios.

The methodology

For each tool/media (SSD and HDD) combination, the runs were done as follows:

  1. Clear the filesystem cache (echo 3 > /proc/sys/vm/drop_caches).
  2. Run the scan once, discarding the result.
  3. Repeat 5 times:
    1. For the no-cache runs, clear the cache again.
    2. Run and time the tool.
  4. Report the average of the above five runs as the result.

The command lines and individual run times are included at the bottom of this article.

Results

1. HDD with cache

HDDcache2. HDD without cache

HDDNOcache3. SSD with cache

SSDcache4. SSD without cache

SSDNOcacheSummary

As you can see above, the ranking varies depending on each scenario. However, I’m happy to see dupd is the fastest in three of four scenarios and a very close second in the fourth.

To conclude with some kind of ranking, let’s look at the average finishing position of each tool:

Tool aveRAGE ranking
dupd 1.3 1, 1, 1, 2
rmlint 3.0 2, 7, 1, 2
jdupes 3.8 3, 2, 5, 5
rdfind 3.8 5, 4, 3, 3
duff 4.5 4, 3, 4, 7
fdupes 5.8 6, 5, 6, 6
fslint 6.0 7, 6, 7, 4

The Raw Data

-----[ rmlint : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 92.07
Running 5 times (timeout=3600): rmlint -o fdupes /hdd/files
Run 0 took 27.83
Run 1 took 27.75
Run 2 took 27.59
Run 3 took 27.75
Run 4 took 27.66
AVERAGE TIME:
27.716


-----[ jdupes : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 268.6
Running 5 times (timeout=3600): jdupes -A -H -r -q /hdd/files
Run 0 took 5.96
Run 1 took 5.94
Run 2 took 5.96
Run 3 took 5.93
Run 4 took 5.99
AVERAGE TIME:
5.956


-----[ dupd-hdd : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 46.78
Running 5 times (timeout=3600): dupd scan -q -p /hdd/files
Run 0 took 3.24
Run 1 took 3.29
Run 2 took 3.21
Run 3 took 3.21
Run 4 took 3.25
AVERAGE TIME:
3.24


-----[ rdfind : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 90.25
Running 5 times (timeout=3600): rdfind -n true /hdd/files
Run 0 took 8.52
Run 1 took 8.49
Run 2 took 8.53
Run 3 took 8.48
Run 4 took 8.41
AVERAGE TIME:
8.486


-----[ fslint : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 103.64
Running 5 times (timeout=3600): findup /hdd/files
Run 0 took 20.38
Run 1 took 20.4
Run 2 took 20.36
Run 3 took 20.36
Run 4 took 20.39
AVERAGE TIME:
20.378


-----[ fdupes : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 278.11
Running 5 times (timeout=3600): fdupes -A -H -r -q /hdd/files
Run 0 took 15.76
Run 1 took 15.78
Run 2 took 15.72
Run 3 took 15.74
Run 4 took 15.88
AVERAGE TIME:
15.776


-----[ duff : CACHE KEPT : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 935.51
Running 5 times (timeout=3600): duff -r -z /hdd/files
Run 0 took 7.03
Run 1 took 7.01
Run 2 took 6.98
Run 3 took 6.99
Run 4 took 6.99
AVERAGE TIME:
7


-----[ rmlint : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 90.59
Running 5 times (timeout=3600): rmlint -o fdupes /hdd/files
Run 0 took 89.86
Run 1 took 89.4
Run 2 took 90.44
Run 3 took 89.87
Run 4 took 90.84
AVERAGE TIME:
90.082


-----[ jdupes : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 269.69
Running 5 times (timeout=3600): jdupes -A -H -r -q /hdd/files
Run 0 took 268.97
Run 1 took 270.07
Run 2 took 268.52
Run 3 took 268.95
Run 4 took 269
AVERAGE TIME:
269.102


-----[ dupd-hdd : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 46.26
Running 5 times (timeout=3600): dupd scan -q -p /hdd/files
Run 0 took 46.37
Run 1 took 46.43
Run 2 took 46.24
Run 3 took 46.68
Run 4 took 46.62
AVERAGE TIME:
46.468


-----[ rdfind : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 86.59
Running 5 times (timeout=3600): rdfind -n true /hdd/files
Run 0 took 86.48
Run 1 took 87.02
Run 2 took 86.55
Run 3 took 86.57
Run 4 took 86.75
AVERAGE TIME:
86.674


-----[ fslint : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 103.29
Running 5 times (timeout=3600): findup /hdd/files
Run 0 took 103.49
Run 1 took 103.64
Run 2 took 102.97
Run 3 took 103.16
Run 4 took 103.28
AVERAGE TIME:
103.308


-----[ fdupes : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 276.02
Running 5 times (timeout=3600): fdupes -A -H -r -q /hdd/files
Run 0 took 276.88
Run 1 took 276.18
Run 2 took 276.83
Run 3 took 277.87
Run 4 took 276.99
AVERAGE TIME:
276.95


-----[ duff : CACHE CLEARED EACH RUN : /hdd/files]------
Running one untimed scan first...
Result/time from untimed run: 935.56
Running 5 times (timeout=3600): duff -r -z /hdd/files
Run 0 took 936.06
Run 1 took 936.87
Run 2 took 936.58
Run 3 took 937.01
Run 4 took 935.95
AVERAGE TIME:
936.494


-----[ rmlint : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 18.62
Running 5 times (timeout=3600): rmlint -o fdupes /ssd/files
Run 0 took 6.38
Run 1 took 6.33
Run 2 took 6.3
Run 3 took 6.32
Run 4 took 6.32
AVERAGE TIME:
6.33


-----[ jdupes : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 35.12
Running 5 times (timeout=3600): jdupes -A -H -r -q /ssd/files
Run 0 took 6.89
Run 1 took 6.84
Run 2 took 6.88
Run 3 took 6.83
Run 4 took 6.91
AVERAGE TIME:
6.87


-----[ dupd-hdd : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 19.85
Running 5 times (timeout=3600): dupd scan -q -p /ssd/files
Run 0 took 3.34
Run 1 took 3.17
Run 2 took 3.25
Run 3 took 3.3
Run 4 took 3.29
AVERAGE TIME:
3.27


-----[ rdfind : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 31.43
Running 5 times (timeout=3600): rdfind -n true /ssd/files
Run 0 took 8.5
Run 1 took 8.38
Run 2 took 8.42
Run 3 took 8.39
Run 4 took 8.38
AVERAGE TIME:
8.414


-----[ fslint : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 44.67
Running 5 times (timeout=3600): findup /ssd/files
Run 0 took 20.63
Run 1 took 20.58
Run 2 took 20.54
Run 3 took 20.54
Run 4 took 20.53
AVERAGE TIME:
20.564


-----[ fdupes : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 42.13
Running 5 times (timeout=3600): fdupes -A -H -r -q /ssd/files
Run 0 took 15.68
Run 1 took 15.52
Run 2 took 15.53
Run 3 took 15.56
Run 4 took 15.54
AVERAGE TIME:
15.566


-----[ duff : CACHE KEPT : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 32.54
Running 5 times (timeout=3600): duff -r -z /ssd/files
Run 0 took 7
Run 1 took 6.96
Run 2 took 6.98
Run 3 took 6.95
Run 4 took 6.95
AVERAGE TIME:
6.968


-----[ rmlint : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 17.39
Running 5 times (timeout=3600): rmlint -o fdupes /ssd/files
Run 0 took 17.29
Run 1 took 17.21
Run 2 took 17.25
Run 3 took 17.24
Run 4 took 17.31
AVERAGE TIME:
17.26


-----[ jdupes : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 34.36
Running 5 times (timeout=3600): jdupes -A -H -r -q /ssd/files
Run 0 took 34.3
Run 1 took 34.35
Run 2 took 34.48
Run 3 took 34.34
Run 4 took 34.36
AVERAGE TIME:
34.366


-----[ dupd-hdd : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 19.7
Running 5 times (timeout=3600): dupd scan -q -p /ssd/files
Run 0 took 19.67
Run 1 took 19.65
Run 2 took 19.66
Run 3 took 19.65
Run 4 took 19.51
AVERAGE TIME:
19.628


-----[ rdfind : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 30.93
Running 5 times (timeout=3600): rdfind -n true /ssd/files
Run 0 took 30.7
Run 1 took 30.61
Run 2 took 30.72
Run 3 took 30.8
Run 4 took 30.79
AVERAGE TIME:
30.724


-----[ fslint : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 44.41
Running 5 times (timeout=3600): findup /ssd/files
Run 0 took 44.23
Run 1 took 44.3
Run 2 took 44.44
Run 3 took 44.24
Run 4 took 44.41
AVERAGE TIME:
44.324


-----[ fdupes : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 42.05
Running 5 times (timeout=3600): fdupes -A -H -r -q /ssd/files
Run 0 took 41.79
Run 1 took 41.79
Run 2 took 41.79
Run 3 took 41.8
Run 4 took 41.92
AVERAGE TIME:
41.818


-----[ duff : CACHE CLEARED EACH RUN : /ssd/files]------
Running one untimed scan first...
Result/time from untimed run: 32.45
Running 5 times (timeout=3600): duff -r -z /ssd/files
Run 0 took 32.49
Run 1 took 32.48
Run 2 took 32.52
Run 3 took 32.49
Run 4 took 32.48
AVERAGE TIME:
32.492

 

dupd 1.7 released

I just tagged release 1.7 of dupd.

See the ChangeLog for a list of changes.

The major change is that the SSD mode has been removed. That’s a bit sad because as I’ve written over the years, the SSD mode is/was faster in some scenarios when reading off a SSD. However, as I’ve also mentioned, the drawback was that the SSD mode could be vastly slower in the “wrong” circumstances.

As of version 1.6 I intended to keep both so the best one can be used in each situation. During the 1.7 development I ended up doing internal code refactoring on the HDD mode to make it simpler and faster. The end result was that the two implementations diverged ever further apart to the point that they barely shared any code. And the HDD mode became almost as fast as the SSD mode at its best, although not quite.

Maintaining the two separate code paths was becoming too much of a burden though. Given the nature of a hobby open source project I can only dedicate so much (read: little) time to it so, sadly, the SSD mode is gone. On the positive side, this allowed me to remove a lot of code, which is nice.

dupd 1.6 released

I just tagged release 1.6 of dupd.

See the ChangeLog for a list of changes.

As usual, there is not much in the way of directly user-visible changes. It should continue to work as it did. If not, let me know.

One change worth pointing out is that the HDD mode is now the default (a new option, –ssd, allows selecting the SSD mode). I’m on SSDs myself and generally use the SSD mode so why the change? Well, the HDD mode is a more conservative default because even though the SSD mode is often faster it is also true that the SSD mode can be dramatically slower in worst-case scenarios.

One downside of the HDD mode is that it uses more RAM. I haven’t seen it use excessive memory with any real-world file sets I have but in theory it could. If you run into this, there is a new option to limit buffer sizes (see man page).

Although not user-visible, this release does contain a significant rewrite of several subsystems (dir and path storage, thread work queues). So let me know if any bugs surface.

2018 January Trip to Monterey

Started the year off right with a family sail down to Monterey and back on our Catalina 270.

Day 1

SC2MUnfortunately, no wind at all today so had to motor all the way. Glassy seas and a nice day though.

DSC18A_3939My son did a great job steering to the compass for a big part of the trip:

DSC18A_3934Abundant sea life on this trip. First we saw several whales which surfaced many times.

DSC18B_7142Later we saw a large group of dolphins cross our bow and head off into the distance. Based on the white stripe on the fin, I’m guessing they were Dall’s Porpoises.

DSC18B_7148Later, approaching Monterey,

DSC18A_3944I suddenly noticed the water was filled with jellies!

DSC18A_3961DSC18A_3976Soon after we were tied off at the Breakwater Cove Marina in Monterey.

DSC18A_3984After a nice dinner at Gianni’s Pizza, back to the boat for the evening.

DSC18A_4034Day 2

Rain in the forecast for today. sadly. On the plus side, at least some wind.

Heading north from Monterey there was indeed a bit of wind so finally got to do some proper sailing. The fun only lasted about 4 nautical miles though, after which the wind became light and speed dropped below 3 knots. So I motorsailed the rest of the day to get home sooner given the cold and steady rain.

We did see two whales which surfaced quite close to the boat. Given the rain, didn’t take the camera out today so no photos. Also, the GPS managed to reset itself somehow, so didn’t record the whole track.

 

 

Sailing from Alameda to Santa Cruz

It was finally time this weekend to move my Catalina 270 from Alameda to my home port of Santa Cruz. We had a fun summer of sailing and weekend trips (Pier 39, Angel Island) in San Francisco Bay but I wanted to bring the boat home before winter weather (storms and bigger swell) arrives.

Joining me for the trip was Joe C. We’d raced together in the Club Nautique Bay Buoy Boogie event this summer. The trip to Santa Cruz can be done in one long a day but it is easier to break it down into two parts so I planned a stop at Half Moon Bay.

Day 1

NOAA Tidal Current Tables
San Francisco Bay Entrance (Golden Gate), Calif., 2017

#      Slack  Max Curr  Slack  Max Curr  Slack  Max Curr  Slack
----------------------------------------------------------------
29 Fr  0419  0700 1.6F  0954  1239 1.7E  1606  1752 0.8F  1936

National Weather Service Marine Forecast FZUS56 KMTR
Point Reyes To Pigeon Point To 10 nm-

FRI: NW winds 5 to 15 kt...
    increasing to 15 to 25 kt in the afternoon.
    Wind waves 3 to 5 ft. W swell 3 ft at 10 seconds.

Forecast calls for nearly flat seas and a good breeze, so it seemed (and was) a perfect day for the trip south to Half Moon Bay.

We cast off from Marina Village at 7:46am.

Seven minutes later, the diesel sputtered and stalled! Might’ve been an early end to the trip but fortunately after waiting a few moments, the engine restarted easily and ran strong for the rest of the trip. The way it stalled felt like an air bubble in the diesel line but unlike my old Catalina 27, the fuel system on the 270 is self-bleeding. In any case, the engine ran perfectly for about 10 hours after that so no further problems.

DSC17A_3257We motored out to the Golden Gate bridge, staying closer to the city front to avoid the trailing end of the flood. We crossed under the Golden Gate at 9:52am, right before slack water (9:54am). It would’ve been nice to time it to get a push from the ebb but this would’ve moved our arrival in Half Moon Bay a bit too late so slack was good enough for today.

We kept motoring west just south of the shipping channel until the R”2″ buoy. This is a bit of a detour but it avoids the south bar. On a flat day like today we could’ve turned south much earlier but out of habit and prudence I prefer to turn left near R”2″.

Several pods of dolphins popped in and out in the vicinity of the boat but none stayed for long.

Once we reached R”2″ it was finally time to get the main up and go sailing! The reef line managed to shake itself into a knot around the main so that took some resolving but after that the engine was off and we’re sailing at about 11:30am.

Sea was flat and breeze was good, both right on forecast. One long starboard tack past the R”26″ buoy (off Pt. Montara and the Colorado Reef) and down to the RW “PP” buoy marking the entrance to Pillar Point Harbor.

DSC17A_3270Pillar Point Harbor is perhaps best known (at least to non-boaters) as the location of the Mavericks surf break, a result of the reefs which surround this area. So it’s wise to stay in the correct channel.

hmb_entranceAt 4pm we dropped the main after 4h:30m of sailing and by 4:25pm we were tied up at a guest slip in the harbor. Pillar Point is a busy commercial harbor so we had rather large neighbors:

DSC17A_3274For dinner and drinks we stopped by the Half Moon Bay Yacht Club, a fun spot, highly recommended.

Summary:

  • Departure: 7:46am
  • Arrival: 4:25pm
  • Distance: 42.8 nautical miles
  • Total time: 8h:39m
  • Motoring: about 4h

track_day_1

Day 2

National Weather Service Marine Forecast FZUS56 KMTR
Point Reyes To Pigeon Point To 10 nm-

FRI NIGHT: NW winds 15 to 25 kt. Wind waves 3 to 5 ft. NW swell 4 to 6 ft at 10 seconds.
SAT: NW winds 20 to 30 kt. Wind waves 4 to 6 ft. NW swell 6 to 8 ft at 10 seconds.

According to the forecast the wind was going to blow all night and increase to as much as 30 kts during the day. The trip to Santa Cruz is down wind and with the swells on the stern so I wasn’t worried about the wind but still, was hoping for closer to 20 than 30. As it turned out, this wind forecast was quite a bit off from reality.

We cast off at 7:11am and motored south until past the G “1s” buoy and then hoisted the main at 7:55am. There was very little wind though (where is the 20-30kts?) so we kept motoring for a long time. The flapping of the main became annoying after not too long so I rigged a preventer and we kept it on for the rest of the trip. Much better!

The swells were perfect and only got better through the day. First on the starboard quarter and then right behind us as we turned near Pigeon Point, we spent all day surfing down the coast.

Between Half Moon Bay and Pescadero we saw a number of whales traveling in many groups. They all seemed to be congregated in this area as we didn’t see any further south (nor north).

I don’t have the exact time but I believe we were able to shut down the engine right around noon. The wind had picked up enough that by then we managed to keep the average speed above 5 knots without the engine. Under main alone as we were going just about dead downwind, no use for the jib today (I don’t have a whisker pole).

A bit further south we sailed by several Mola Molas (sunfish) sunning themselves near the surface. Always interesting to see these!

The ride down from Pigeon Point to Santa Cruz was ideal sailing, as good as it gets! Surfing down the 6-8ft swells running near hull speed for hours! Hull speed for a boat with a 24ft waterline length is about 6.5 knots and we were averaging 6.5-6.7 kts in this time window:

speed_day_2Surfing down the swells we’d regularly hit 8 and even 9 knots. The very largest one had us hitting an astonishing 12.7 knots (SOG)!

DSC17A_3319Here a couple photos of the swell at our stern about the lift the boat:

DSC17A_3289This is how much fun it was all afternoon:
DSC17A_3306I rarely take much video because I find that it never captures the essence of the ride, whether mountain biking nor sailing, so I’m always disappointed by the result. I took a bit of video on this day hoping to capture how fun this rollercoaster ride was but as always, it falls flat. Still, here is a link to it:

https://www.youtube.com/watch?v=AvJ41kaIcAk

Right around the Wilder Ranch area the swell shut down as if by a switch and shortly afterwards the wind was gone as well. This was around 3:20pm so then we motored the rest of the way into Santa Cruz. Tied up at the dock at 4:30pm.

Summary:

  • Departure: 7:11am
  • Arrival: 4:30pm
  • Distance: 48.3 nautical miles
  • Total time: 9h:19m
  • Motoring: about 6h

track_day_2

 

This was the first longer trip with substantial motoring I’ve done with this boat so I didn’t have a good sense of fuel consumption yet. So here’s what I observed:

  • At 2000rpm average speed is right around 5 knots.
  • At 2500rpm the average is a bit higher but not by much. Probably not worth it unless every fraction of a knot counts.
  • On this trip we motored just about 10 hours. Afterwards, I filled the tank with about 3 gallons so consumption was about 0.3 gal/hr.

Total distance 91 nautical miles from Marina Village Yacht Harbor to Santa Cruz Yacht Harbor.

 

Long weekend at Angel Island

chartOver the Labor Day weekend we had a great four day trip to Angel Island. In order to beat the rush, we sailed over from Alameda on Thursday to grab a spot. Fortunately plenty of mooring spots available still on Thursday afternoon so we grabbed one relatively close to the beach.

I brought some floating towing line (Samson Ultra Blue) which, combined with the ‘Happy Hooker’ mooring hook, made it quick and easy to tie up.

DSC17A_3237Preparing for four days, we brought all the toys! The BBQ of course, both an inflatable kayak and an inflatable SUP. Even brought my son’s bicycle in order to bike around the island. (Our bikes were too big to bring, so had to rent adult bikes on the island.)

DSC17B_6993Water in the cove was flat all weekend, it was absolutely perfect for kayaking and SUP’ping.

DSC17A_3251DSC17B_6975This was the weekend of a record breaking heat wave in the bay area and it was over 100F at Angel Island. This made the cold water invitingly refreshing! And so we spent a lot of time swimming between the boat and the beach. Also from boat to boat, as my son made friends with the kids on the next boat over. It was surely a miserably hot weekend in most parts of the bay but it made for a perfect way to enjoy the water!

Here’s a pic of our boat as I’m approaching on the kayak.

DSC17B_7008On the first morning, my son stood up in his viewing station (the V berth window) and announced there was a boat upside down just outside.

And turns out, there really was! Later I read in Latitude 38 that it had flipped while racing and they left it overnight at the cove. A bit later a boat came and towed it away.

DSC17A_3231And that’s why I prefer monohulls.

The bike ride around the island was fun, although hot on that weekend.

DSC17B_7047Finally a nice sunset photo, because every travel story needs a nice sunset photo.

DSC17B_7067This was the first longer (more than two days) trip on our C270 and was great fun. The only downside is we ran out of water in the tank on day two (of four) even though we were trying everything possible to conserve. The C270 water tank is only 13 gallons, so not much there. I’ve been thinking of adding a secondary tank under the V berth, as there is some empty space there.

dupd: Introducing HDD mode

For most of its development, my duplicate detection utility dupd has been optimized for SSDs only. This wasn’t an intentional choice per se, just a side effect of the fact that the various machines I tend to test and develop on are all SSD based.

The 1.4 release introduces support for a new scan mode which works better on hard disk drives (HDDs). While this mode does have additional overhead (both CPU and RAM) compared to the default mode (which makes it generally slower if the data is on a SSD) it more than makes up for it by reducing the time spent waiting for I/O if the file data is scattered on spinning rust.

Here are some runs from a HDD-based machine I have. The file set consists of general data of all kinds from a subset of my home directory. There are 148,933 files with 44,339 duplicates.

The timings are the average of 5 runs, with the filesystem cache cleared (echo 3 > /proc/sys/vm/drop_caches) before each run (this is highly artificial, of course, as you’d never ever do that in real life, but interesting for testing a worst-case scenario).

dupd_14_scanHere the –hdd mode is almost 12x faster (68 seconds vs. 813 seconds)!

It is important to note that if the file data being scanned is in the filesystem cache then you are better off using the default mode even if the underlying files are stored on a HDD. If you are cleaning duplicates “the dupd way” and the machine has enough RAM then it is more likely than not that most or all of the data will be in the cache in all runs except the first one.

My rule of thumb recommendation on a HDD-based machine is to always run the first scan using the –hdd mode and then try subsequent scans both with and without the –hdd mode to see which works best on your hardware and with that particular data set. As with all things performance, YMMV!

 

Kids Bikes Weights

DSC16A_7958As my son grows, so do his bikes!

First bike: Strider

The strider balance bike was the first one, at a year and a half old. No pedals, no brakes. Very light at 6.48 lbs. It was a great first bike to learn how to coast and balance without any pedals getting in the way. Even after he started riding the next bike, he kept using the balance bike for the local pump track. He rode this one almost until five years old!

DSC16B_6529Second bike: 12″ Specialized HotRock

When it was time to learn pedaling, we moved up to the 12″ HotRock when he was a bit over 3.5 years old. A big jump in weight to 15.6 lbs. And pedals! This bike comes with training wheels but we never used them since he already knew how to balance from the previous bike.

It only has a coaster brake (pedal backwards) unfortunately. I got some parts online and added a hand brake for the front wheel. It didn’t have much braking power but it served its purpose of getting him used to regular brakes (when he test rode the next bike which has only hand brakes, he was immediately familiar with using them).

DSC16B_6527Third bike: Cleary Hedgehog 16″

This is one nice bike, a huge step up in quality from the previous ones. This is a kids bike in size only. I’d love to ride it myself if I could fit! He’d been asking for big boy brakes (hand brakes) for a while and the Cleary delivers, for his fifth birthday.

(In comparison, the 16″ Specialized HotRock inexplicably still has a coaster brake and training wheels.. Specialized completely lost the plot in this age segment!)

The Hedgehog is also fairly light, which is very nice as he is starting to ride some of the hilly mountain bike trails nearby. Even though it is a much larger bike than the 12″ one, it is barely over one pound heavier, at 16.8 lbs.

DSC16B_6526

 

dupd vs. jdupes (take 2)

My previous comparison of dupd vs. jdupes prompted the author of jdupes to try it out on his system and write a similar comparison of dupd and jdupes from his perspective.

The TL;DR is you optimize what you measure. For the most part, jdupes is faster on Jody’s system/dataset and dupd is faster on mine. It’d be fun to do a deeper investigation on the details (and if I ever have some extra spare time I will) but for now I ran a few more rounds of tests to build on the two previous articles.

Methodology Notes

  • First and foremost, while it is fun to try to get optimal run times, I don’t want to focus on scenarios which are so optimized for performance that they are not realistic use cases for me. Thus:
    • SQLite overhead: Yes, dupd saves all the duplicate data into an sqlite database for future reference. This does add overhead and Jody’s article prompted me to try a few runs with the –nodb option (which causes dupd to skip creating the sqlite db and print to stdout instead, just like jdupes does). However, to me by far the most useful part of dupd is the interactive usage model it enables, which requires the sqlite database. So I won’t focus much on –nodb runs because I’d never run dupd that way and I want to focus on (my) real world usage.
    • File caches: This time I ran tests both with warmed up file caches (by repeating runs) and purged file caches (by explicitly clearing them prior to every run). For me, the warm file cache scenario is actually the one most closely matching real world usage because I tend to run a dupd scan, then work interactively on some subset of data, then run dupd scan again, repeat until tired. For someone whose workflow is to run a cold scan once and not re-scan until much later, the cold cache numbers will be more applicable.
  • I ran both dupd and jdupes with -q to eliminate informative output during the runs. It doesn’t make much difference in dupd but according to Jody this helps jdupes runtimes so I quieted both.
  • By default, dupd ignores hidden files and jdupes includes them. To make comparable runs, either use –hidden for dupd to include them or –nohidden for jdupes to exclude them. I decided to run with dupd –hidden.
  • The average time reported below from each set of runs is the average of all runs but excluding the slowest and fastest runs.

Results (SSD)

These runs are on a slow machine (Intel Atom S1260 2.00GHz) and SSD drive (Intel SSD 520) running Linux (debian) ext4.

Files scanned: 197,171
Total duplicates: 77,044

For my most common usage scenario, jdupes takes almost three times (~2.8x) longer to process the same file set. Running dupd with –nodb is marginally faster than dupd default run (but I wouldn’t really run it that way because the sqlite db is too convenient).

ssd-warm-cacheNext I tried clearing the file cache before every run. Here the dupd advantage is reduced, but jdupes still took about 1.4x longer than dupd.

ssd-cold-cacheResults (Slow Disk)

Jody’s tests show jdupes being faster in many scenarios, so I’d like to find a way to reproduce that. The answer is slow disks.. I have this old Mac Mini (2009) with an even slower disk which should do the trick. Let’s see. Fewer files, fewer duplicates, but the disk is so slow these runs take a while (so I only ran 5 repetitions instead of 7).

Files scanned: 62,347
Total duplicates: 13,982

Indeed, here jdupes has the advantage as dupd takes about 1.7x longer.

rust-warm-cacheA few notes on these Mac runs:

  • I did also run dupd with –nodb, but here it didn’t make any meaningful difference.
  • I also ran both dupd and jdupes with cold file cache, or at least maybe I did. I ran the purge(8) command prior to each run. It claims to: “Purge can be used to approximate initial boot conditions with a cold disk buffer cache for performance analysis”. However, it made no difference at all in measured times for dupd nor jdupes.

Conclusions

It seems that in terms of performance, you’ll do better with dupd if you’re on SSDs but if you’re on HDDs then jdupes can be faster. Ideally, try both and let us know!

Also, even though tuning and testing the performance is so much fun, ultimately usability matters even more. For me, the interactive workflow supported by dupd is what makes it special (but then, that’s why I wrote it so I’m biased ;-) and I couldn’t live without it.

Finally, thanks to Jody for fixing a bug in dupd that showed up only on XFS (which I don’t use so never noticed) and for prompting me to do a few additional enhancements.

 

Raw Data (commands and times)

SSD: dupd, normal usage

% repeat 7 time dupd scan -p $HOME -q --hidden
dupd scan -p $HOME -q --hidden  4.89s user 8.00s system 138% cpu 9.273 total
dupd scan -p $HOME -q --hidden  5.04s user 8.23s system 142% cpu 9.335 total
dupd scan -p $HOME -q --hidden  4.98s user 7.78s system 139% cpu 9.141 total
dupd scan -p $HOME -q --hidden  4.86s user 7.92s system 139% cpu 9.146 total
dupd scan -p $HOME -q --hidden  5.61s user 8.00s system 143% cpu 9.503 total
dupd scan -p $HOME -q --hidden  4.95s user 7.79s system 140% cpu 9.082 total
dupd scan -p $HOME -q --hidden  4.96s user 7.80s system 139% cpu 9.119 total

average = 9.20

SSD: dupd, clear file cache

% repeat 7 (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"; time dupd scan -p $HOME -q --hidden)
dupd scan -p $HOME -q --hidden  11.86s user 43.55s system 54% cpu 1:42.54 total
dupd scan -p $HOME -q --hidden  12.42s user 44.47s system 55% cpu 1:43.41 total
dupd scan -p $HOME -q --hidden  12.12s user 43.65s system 56% cpu 1:39.03 total
dupd scan -p $HOME -q --hidden  12.22s user 43.37s system 55% cpu 1:40.69 total
dupd scan -p $HOME -q --hidden  12.60s user 45.28s system 53% cpu 1:47.55 total
dupd scan -p $HOME -q --hidden  12.10s user 44.51s system 54% cpu 1:44.18 total
dupd scan -p $HOME -q --hidden  12.43s user 43.74s system 57% cpu 1:36.92 total

average = 101.97

SSD: dupd, do not create database

% repeat 7 time dupd scan -p $HOME -q --hidden  --nodb > results
dupd scan -p $HOME -q --hidden --nodb > results  4.26s user 7.70s system 136% cpu 8.785 total
dupd scan -p $HOME -q --hidden --nodb > results  4.36s user 7.54s system 136% cpu 8.710 total
dupd scan -p $HOME -q --hidden --nodb > results  4.28s user 7.69s system 136% cpu 8.770 total
dupd scan -p $HOME -q --hidden --nodb > results  4.23s user 7.64s system 136% cpu 8.708 total
dupd scan -p $HOME -q --hidden --nodb > results  4.34s user 7.58s system 136% cpu 8.757 total
dupd scan -p $HOME -q --hidden --nodb > results  4.19s user 7.66s system 135% cpu 8.736 total
dupd scan -p $HOME -q --hidden --nodb > results  4.58s user 7.75s system 140% cpu 8.772 total

average = 8.75

SSD: dupd, clear file cache and do not create database

repeat 7 (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"; time dupd scan -p $HOME -q --hidden  --nodb > results)
dupd scan -p $HOME -q --hidden --nodb > results  9.67s user 36.51s system 51% cpu 1:29.39 total
dupd scan -p $HOME -q --hidden --nodb > results  10.92s user 43.76s system 53% cpu 1:41.45 total
dupd scan -p $HOME -q --hidden --nodb > results  10.79s user 43.58s system 54% cpu 1:38.93 total
dupd scan -p $HOME -q --hidden --nodb > results  10.62s user 43.59s system 56% cpu 1:35.45 total
dupd scan -p $HOME -q --hidden --nodb > results  10.76s user 44.39s system 54% cpu 1:41.45 total
dupd scan -p $HOME -q --hidden --nodb > results  10.92s user 43.78s system 55% cpu 1:38.87 total
dupd scan -p $HOME -q --hidden --nodb > results  10.72s user 43.07s system 53% cpu 1:41.50 total

average = 99.23

SSD: jdupes, warm file cache

repeat 7 time jdupes -q -r $HOME > results
jdupes -q -r $HOME > results  10.54s user 14.91s system 99% cpu 25.626 total
jdupes -q -r $HOME > results  10.76s user 14.66s system 99% cpu 25.587 total
jdupes -q -r $HOME > results  10.68s user 14.86s system 99% cpu 25.725 total
jdupes -q -r $HOME > results  10.76s user 14.67s system 99% cpu 25.614 total
jdupes -q -r $HOME > results  10.62s user 14.76s system 99% cpu 25.549 total
jdupes -q -r $HOME > results  10.75s user 14.87s system 99% cpu 25.801 total
jdupes -q -r $HOME > results  10.48s user 14.87s system 99% cpu 25.527 total

average = 25.62

SSD: jdupes, clear file cache

repeat 7 (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"; time jdupes -q -r $HOME  > results)
jdupes -q -r $HOME > results  26.11s user 72.99s system 67% cpu 2:25.89 total
jdupes -q -r $HOME > results  26.54s user 71.06s system 68% cpu 2:22.01 total
jdupes -q -r $HOME > results  24.64s user 72.62s system 66% cpu 2:26.57 total
jdupes -q -r $HOME > results  26.01s user 70.05s system 68% cpu 2:20.15 total
jdupes -q -r $HOME > results  26.25s user 72.48s system 67% cpu 2:26.67 total
jdupes -q -r $HOME > results  24.63s user 70.70s system 67% cpu 2:20.77 total
jdupes -q -r $HOME > results  25.41s user 72.40s system 68% cpu 2:23.80 total

average = 143.81

Slow Disk: dupd, normal usage

% repeat 5 time dupd scan -p $HOME --hidden -q
dupd scan -p $HOME --hidden -q  4.62s user 29.72s system 2% cpu 21:05.70 total
dupd scan -p $HOME --hidden -q  4.38s user 29.88s system 2% cpu 22:34.14 total
dupd scan -p $HOME --hidden -q  4.78s user 30.09s system 2% cpu 21:29.52 total
dupd scan -p $HOME --hidden -q  4.37s user 29.07s system 2% cpu 21:18.10 total
dupd scan -p $HOME --hidden -q  4.39s user 29.24s system 2% cpu 21:11.19 total

average = 1279.60

Slow Disk: jdupes, warm file cache

% repeat 5 time jdupes -q -r $HOME > results
jdupes -q -r $HOME > results  8.88s user 31.70s system 5% cpu 13:05.37 total
jdupes -q -r $HOME > results  8.87s user 30.51s system 5% cpu 12:41.61 total
jdupes -q -r $HOME > results  8.80s user 30.56s system 4% cpu 13:30.56 total
jdupes -q -r $HOME > results  8.85s user 30.62s system 5% cpu 12:34.43 total
jdupes -q -r $HOME > results  8.80s user 30.18s system 5% cpu 12:32.14 total

average = 767.14