Tarsnap - Deduplication explanation

How does deduplication work?

Tarsnap deduplicates your data by:

Splitting it into chunks ("chunkification").
Only uploading chunks that haven't been uploaded before.

Each archive is expressed as a list of chunks, which grants us a great deal of flexibility and efficiency.

Simplified example: wordification

To illustrate how chunkification and chunk-based archives function, let's look at a simpler example: Suppose that we only want to store text files. Instead of splitting our data into a fixed number of characters, we'll split them into words ("wordification").

In particular, we will store a few of revisions of the file pets.txt, wherein we try to list our favourite type of pets.

First version: archive A

Since this is the Internet, we start off with pets.txt containing:

cute small kittens

How can we use "wordification" to help us archive this very important file?

Dataflow chart of a 'wordification' backup service, archive A

The first archive wasn't very exciting. The server knows that archive A consists of the words 1, 2, and 3 (those words being "cute", "small", and "kittens"). It would be easy to restore our valuable file if anything happened to it, but there wasn't any deduplicated data yet.

Second version: archive B

After a bit of thought, we change our pets.txt to:

fuzzy small bunnies

and create a second archive.

Dataflow chart of a 'wordification' backup service, archive B

This archive is more interesting. The computer knows that it already uploaded the word "small", so it can reuse that from the old word list. As a result, archive B is expressed as words 4, 2, and 5.

Later versions: archives C and D, and deleting A

We take three steps here:

Revise pets.txt to:
small fuzzy bunnies
and create archive C.
Delete archive A.
Revise pets.txt to:
small fuzzy kittens
and create archive D.

To focus on how the archives and server data changes, we'll look at all of the revisions in a table instead of diagrams for each archive.

Local computer	Transfer	Server
	(Add / Delete)	Words	Archives
New archive A: cute small kittens	+ words 1 2 3 + archive A	1: cute 2: small 3: kittens	A: 1 2 3
New archive B: fuzzy small bunnies	+ words 4 5 + archive B	1: cute 2: small 3: kittens 4: fuzzy 5: bunnies	A: 1 2 3 B: 4 2 5
New archive C: small fuzzy bunnies	+ archive C	1: cute 2: small 3: kittens 4: fuzzy 5: bunnies	A: 1 2 3 B: 4 2 5 C: 2 4 5
Delete archive A	- words 1 3 - archive A	2: small 4: fuzzy 5: bunnies	B: 4 2 5 C: 2 4 5
New archive D: small fuzzy kittens	+ word 6 + archive D	2: small 4: fuzzy 5: bunnies 6: kittens	B: 4 2 5 C: 2 4 5 D: 2 4 6

When we added archive C, we didn't need to upload any new words at all. Clear win for our wordification-based deduplication!

Deleting archive A prompted the server to remove any words that we no longer needed. That reduced our storage costs, however...

... when we added D, we incurred an "extra" transfer: Re-uploading the word "kittens" (now called word 6 instead of word 3). We could have avoided that transfer by creating archive D before deleting archive A.

Extending the "wordification" example

The simple "wordification" example only contained a single file, but it's a useful starting point to understand how Tarsnap archives work. Let's imagine how we could extend the example to cover a more realistic archive system:

Multiple files: Instead of an archive merely containing a list of words, the archive could contain a list of filenames, and each filename would then be associated with a list of words. For example:
E: pets.txt (2 4 6) uvic.txt (4 5) internet.txt (3 3 3)
Note that our list of words is shared and reused between all three files!
Files metadata: Information such as user/group ownership, permissions, and the modification date can also be associated with each filename.
Directories: Directories can be stored as a special case of files.
Local caching: The word list was sent from the server to the local computer every time a new archive was being created. That could make backups quite slow! In a real-world system, the list of words (or a hash of the chunked data) should be kept on the local system.
Encrypted data: The simplified example stores server data unencrypted — if a hacker broke in, they could find out that we considered bunnies as pets instead of kittens! To protect against that, the data should be encrypted with a private key on the local computer. For example, we should change:
2: fuzzy → 2: bzFxlArO5is=

For the curious, the above encrypted string was created with:
openssl enc -base64 -des-ecb -K 0123456789 -in text.txt -out text.base64
Tarsnap uses different cryptographic algorithms.
Encrypted metadata: However, encrypted data isn't enough. If an attacker saw that pets.txt had archive B: 4 2 5 and archive D: 2 4 6, they could still figure out that kittens were not our only choice for pets! To fix this, we need to encrypt the labels of words, as well as the archive lists. So we should replace:
2: bzFxlArO5is= → 3l2jG1Wwu1k=: bzFxlArO5is=
Naturally, the archive metadata should also be encrypted:
D: 2 4 6 → fnoMOFFK6Mk=: UB2vxoD8qbg=
Note that even the number of words in D is hidden —instead of storing the three word numbers as (2 4 6 → 3l2jG1Wwu1k= 22iBrXTZfNU= 6WsyTW1UI5k=), we encrypt the entire list to reduce the amount of information that an attacker could gather.
Non-text files: Most of our data is not in easily-handled text files, of course! How can we handle binary data?
First, let's remind ourselves of two benefits of "wordification":
- Variable-sized chunks: Words have different numbers of characters, so splitting data into fixed-size chunks (say, every 4 characters) would not be ideal.
- Context-dependent splitting: We don't try to guess where we should split our data before we've looked at the data. Words are split in spaces — in other words, the location of each split depends on the context (data).
Retaining those benefits when dealing with binary data requires some complicated processing. Tarsnap's method of doing this is outlined in section 2.1 of the presentation From bsdtar to tarsnap: Building an online backup service. The exact method is in the client source code, tar/multitape/chunkify.c.