Tarsnap - Online backups for the truly paranoid

How efficient is deduplication?

Tarsnap automatically "deduplicates" — that is, identifies and removes duplicate blocks of data — from the archives it stores. While any one archive will usually not contain many copies of the same data (although there are exceptions, e.g., software developers with multiple check-outs of different source code trees), it is very common to have several archives created at different times which share a lot of their contents.

How efficient is this process? It depends on what kind of data you are backing up, and how often. Let's look at a few examples, produced via the --print-stats option.

#1: Hourly Tarsnap server

Here's an example of hourly backup #24584 of the entire hard disk of one of the tarsnap.com servers:

                     Total size  Compressed size
All archives     96501744756700  22749145507765
  (unique data)     56058075815  15800338557
This archive         7588323817  1665760862
New data                2241569  814611

This server has almost three years' worth of hourly snapshots, which add up to a total of 96,492 GB. But after duplicate blocks of data are removed (or more precisely, only stored once), that drops to 56.1 GB, which is then compressed to 15.8 GB — an amount which costs $3.95/month to store.

#2: Infrequent personal usage

By contrast, here's backup #32 of personal email and documents, which are run "almost every month":

                     Total size  Compressed size
All archives        14304559063  10709002903
  (unique data)       961452556  673438434
This archive          526692949  388934283
New data                2308253  359832

This computer is only backing up half a gigabyte; the sum of all the snapshots would give 14,305 MB (email accumulates over time). But thanks to deduplication, these snapshots can be reproduced with only 961 MB, which is further compressed to 673 MB, which costs $0.17/month to store.


Summary of the examples:

Example               #1        #2
Snapshots          24584        32
Frequency         hourly   monthly
All data       96,492 GB   14.3 GB
Unique data      56.1 GB   0.96 GB
Encoded data     15.8 GB   0.67 GB
Storage cost       $3.95     $0.17