David Sun's Blog

This isn't the blog you're looking for.

Eliminating Duplicate Files in an Archive

Like many others, I was excited years ago when Google Plus became available by invite. I joined immediately back in 2011 and used it until its shutdown in 2019. While similar in concept to Facebook, it didn’t have any of the extra nonsense that comes with Facebook. When Google Plus was shutdown, I used Google Takeout to get an archive of all my data. It amounted to 47GB in about 72,000 files. Wow. That’s quite a bit of data! It has been happily sitting around on my local hard drive as well as on my local and cloud backups since then.

It isn’t too surprising that the majority of this data consists of photos and videos. What is surprising is that there are duplicate copies of nearly everything in the archive! It turns out that there is one copy of each file in the Photos directory tree, and another copy in the Posts directory.

Naturally, I wanted to find a way to eliminate these duplicates without breaking the directory structure. Normal file compression, for example ZIP, didn’t work as the compression algorithms aren’t designed to find this kind of duplicate content. It turns out that no commonly used archive file format is.

Many file systems have a way of handling this via hard links. Multiple file names can point to the same exact data, resulting in just one copy of the duplicate data. There is a utility called hardlink that will locate duplicate files and replace the duplicates with hard links! I was a bit surprised that it was already included in my WSL Ubuntu distribution and that I’ve never heard of this particular utility before! Sadly, it does not have a recursive flag and requires the user to call it with each directory as an argument. But that is easily rectified with a bit of shell scripting:

#!/bin/bash
set -euo pipefail

SRC="$1"

DIRS=()
while read -r FILE; do
  DIRS+=("${FILE}")
done < <(find "${SRC}" -type d -print)

hardlink -tO "${DIRS[@]}"

I’m sure there is a reliable one-liner that works in bash or other shells and handles spaces and other characters that might occur in directory names but this was quick and easy.

So now my 47GB of data was reduced to a mere 22GB! Apparently, some files had multiple duplicates! It was still 72,000 files though. Ideally I wanted it to be stored in some sort of archive file so that it could be moved and copied around without losing the data deduplication. Also some cloud sync and backup applications perform poorly with many files compared to handling one larger archive. It turns out that the ISO file format (and thus CDs and DVDs) have a form of hard link support! I ran genisoimage, aka mkisofs, which did not come out of the box with my Ubuntu distribution (apt install genisoimage), to create a ISO file out of my deduplicated directory. And indeed the ISO file was 22GB and works correctly when mounted as a virtual drive under Windows! Success!

Interestingly, while on this journey I discovered that it is possible to store transparently compressed data in ISO files. The data is compressed but can be opened just like any plain old file without the use of a decompression utility, much like other filesystems like NTFS and ZFS that support storing compressed file data. This is accomplished by using the mkzftree utility, which comes with the genisoimage package, to create a directory structure containing compressed files. Then, genisoimage is run using the compressed directory and the -z argument to create an ISO file. Unfortunately, the transparent part doesn’t work on Windows. When opening a file, you get the compressed data. Apparently the transparent decompression part is only supported by Linux according to the genisoimage man page. Oh well. Luckily, JPEGs and videos typically cannot be losslessly compressed in any meaningful way so it wasn’t a big loss for my use case.

August 9, 2022

davidsun

Software

unix

Eliminating Duplicate Files in an Archive

Leave a Reply Cancel reply