Tarsnap - Efficiency

Efficiency

When creating archives, Tarsnap takes streams of archive data and splits them into variable-length blocks; these blocks are compared, and any duplicate blocks are removed (i.e., the data is "de-duplicated") before being uploaded to the Tarsnap server.

Tarsnap keeps a local cache telling it what blocks have been previously stored, and uses this when creating further archives; thus storing two archives containing the same data only takes very slightly more space (due to a small amount of per-archive overhead) than storing a single archive. More importantly, if files change between archives, only the modifications will need to be uploaded when the second archive is created.

Due to Tarsnap's "de-duplication" functionality, it uses the same or less storage than a traditional full-plus-incrementals backup system, while still providing the flexibility of allowing archives to be created and deleted independently of each other — and in the case of log files, mail spools, and other large files which have small amounts of data appended to them frequently, Tarsnap uses far less bandwidth and storage than incremental backups, since Tarsnap avoids storing multiple copies of the unchanging segments of files. We have collected a few examples of deduplication efficiency.

Tarsnap also matches the run-time performance of incremental backups, by keeping a cache containing the paths, inode numbers, sizes, and modification times of files; in the event that files have not been modified, Tarsnap (like incremental backup systems) will avoid spending time reading them from disk.