How efficient is deduplication?
Tarsnap automatically "deduplicates" — that is, identifies and removes duplicate blocks of data — from the archives it stores. While any one archive will usually not contain many copies of the same data (although there are exceptions, e.g., software developers with multiple check-outs of different source code trees), it is very common to have several archives created at different times which share a lot of their contents.
How efficient is this process? It depends on what kind of
data you are backing up, and how often. Let's look at a
few examples, produced via the
#1: Hourly Tarsnap server
Here's an example of hourly backup #24584 of the entire hard disk of one of the tarsnap.com servers:
Total size Compressed size All archives 96501744756700 22749145507765 (unique data) 56058075815 15800338557 This archive 7588323817 1665760862 New data 2241569 814611
This server has almost three years' worth of hourly snapshots, which add up to a total of 96,492 GB. But after duplicate blocks of data are removed (or more precisely, only stored once), that drops to 56.1 GB, which is then compressed to 15.8 GB — an amount which costs $3.95/month to store.
#2: Infrequent personal usage
By contrast, here's backup #32 of personal email and documents, which are run "almost every month":
Total size Compressed size All archives 14304559063 10709002903 (unique data) 961452556 673438434 This archive 526692949 388934283 New data 2308253 359832
This computer is only backing up half a gigabyte; the sum of all the snapshots would give 14,305 MB (email accumulates over time). But thanks to deduplication, these snapshots can be reproduced with only 961 MB, which is further compressed to 673 MB, which costs $0.17/month to store.
Summary of the examples:
Example #1 #2 Snapshots 24584 32 Frequency hourly monthly All data 96,492 GB 14.3 GB Unique data 56.1 GB 0.96 GB Encoded data 15.8 GB 0.67 GB Storage cost $3.95 $0.17 (monthly)