GNU/Linux Crypto: Backups

This entry is part 8 of 10 in the series GNU/Linux Crypto.

While having local backups for quick restores is important, such as on a USB disk or spare hard drive, it’s equally important to have a backup offsite from which you can restore your important documents if, for example, your office was burgled or burned down, losing both your workstation and backup media.

The easiest way to do this for most people is with a storage provider, offering convenient access to bulk storage of suitable size maintained on another company’s systems for a relatively modest price or even for free, such as the Ubuntu One service, or Microsoft’s offering, Skydrive. The best storage providers will also encrypt the data on their own servers, whether or not they have access.

Trusting a company with all your data and the encryption thereof is risky, particularly given recent revelations of corporate collusion with the NSA, and privacy-conscious users should prefer the security of encrypting the backups before they go up onto the provider’s servers. The provider may implement closed and/or symmetric encryption mechanisms of their own, which may or may not be trustworthy. For very strong personal encryption, as established, we can use our GnuPG setup to encrypt files before we put them up there:

$ tar -cf docsbackup-"$(date +%Y-%m-%d)".tar $HOME/Documents
$ gpg --encrypt docsbackup-2013-07-27.tar
$ scp docsbackup-2013-07-27.tar.gpg user@backup.example.com:docsbackup

The problem with encrypting whole files before we put them up for storage is that for even modestly sized data, performing entire backups and uploading all of the files together every time can cost a lot of bandwidth. Similarly, we’d like to be able to restore our personal files as they were on a specific date, in case of bad backups or accidental deletion, but without storing every file on every backup day, which may end up requiring far too much space.

Incremental backups

Normally, the solution is to use an incremental backup system, meaning after first uploading your files in their entirety to the backup system, successive backups upload only the changes, storing them in a retrievable and space-efficient format. Systems like Dirvish, a free Perl frontend to rsync(1), allow this.

Unfortunately, Dirvish doesn’t encrypt the files or changesets it stores. What’s needed is an incremental backup solution that efficiently calculates and stores changes in files on a remote server, and also encrypts them. Duplicity, a Python tool built around librsync, excels at this, and can use our GnuPG asymmetric key setup for the file encryption. It’s available in Debian-derived systems in the duplicity package. Note that, as before, a GnuPG key setup with an agent is required for this to work.

Usage

We can get an idea of how duplicity(1) works by asking it to start a backup vault on our local machine. It uses much the same source destination argument as tools like rsync or scp:

$ cd
$ duplicity --encrypt-key tom@sanctum.geek.nz Documents file://docsbackup

It’s important to specify --encrypt-key, because otherwise duplicity(1) will use symmetric encryption with a passphrase rather than a public key, which is considerably less secure. Specify the email address corresponding to the public keypair you would like to use for the encryption.

This performs a full encrypted backup of the directory, returning the following output:

Local and Remote metadata are synchronized, no sync needed.
Last full backup date: none
No signatures found, switching to full backup.
--------------[ Backup Statistics ]--------------
StartTime 1374903081.74 (Sat Jul 27 17:31:21 2013)
EndTime 1374903081.75 (Sat Jul 27 17:31:21 2013)
ElapsedTime 0.01 (0.01 seconds)
SourceFiles 4
SourceFileSize 142251 (139 KB)
NewFiles 4
NewFileSize 142251 (139 KB)
DeletedFiles 0
ChangedFiles 0
ChangedFileSize 0 (0 bytes)
ChangedDeltaSize 0 (0 bytes)
DeltaEntries 4
RawDeltaSize 138155 (135 KB)
TotalDestinationSizeChange 138461 (135 KB)
Errors 0
-------------------------------------------------

You’ll note you were not prompted for your passphrase to do this. Remember, encrypting files with your public key does not require a passphrase; the whole idea is that anyone can encrypt using your key without needing your permission.

Checking the created directory docsbackup, we find three new files within it, all three of them encrypted:

$ ls -1 docsbackup
duplicity-full.20130727T053121Z.manifest.gpg
duplicity-full.20130727T053121Z.vol1.difftar.gpg
duplicity-full-signatures.20130727T053121Z.sigtar.gpg

The vol1.difftar.gpg file contains the actual data stored; the other two files contain metadata about the backup’s contents, for use to calculate differences the next time the backup runs.

If we make a small change to a file in the directory being backed up, and run the same command again, we note that the backup has been performed incrementally, and only the changes (the new file) have been saved:

$ duplicity --encrypt-key tom@sanctum.geek.nz Documents file://docsbackup
Local and Remote metadata are synchronized, no sync needed.
Last full backup date: Sat Jul 27 17:34:33 2013
--------------[ Backup Statistics ]--------------
StartTime 1374903396.52 (Sat Jul 27 17:36:36 2013)
EndTime 1374903396.52 (Sat Jul 27 17:36:36 2013)
ElapsedTime 0.01 (0.01 seconds)
SourceFiles 5
SourceFileSize 142255 (139 KB)
NewFiles 2
NewFileSize 4100 (4.00 KB)
DeletedFiles 0
ChangedFiles 0
ChangedFileSize 0 (0 bytes)
ChangedDeltaSize 0 (0 bytes)
DeltaEntries 2
RawDeltaSize 4 (4 bytes)
TotalDestinationSizeChange 753 (753 bytes)
Errors 0
-------------------------------------------------

We also find three new files in the docsbackup directory containing the new data:

$ ls -1 docsbackup
duplicity-full.20130727T053433Z.manifest.gpg
duplicity-full.20130727T053433Z.vol1.difftar.gpg
duplicity-full-signatures.20130727T053433Z.sigtar.gpg
duplicity-inc.20130727T053433Z.to.20130727T053636Z.manifest.gpg
duplicity-inc.20130727T053433Z.to.20130727T053636Z.vol1.difftar.gpg
duplicity-new-signatures.20130727T053433Z.to.20130727T053636Z.sigtar.gpg

Note that the new files have the prefix duplicity-inc- or duplicity-new-, denoting them as incremental backups and not full ones.

Note that in order to keep track of what files have already been backed up, duplicity(1) stores metadata in ~/.cache/duplicity, as well as storing them along with the backup. This allows us to let our backup processes run unattended, rather than having to put in our passphrase to read the metadata on the remote server before performing an incremental backup. Of course, if we lose our cached files, that’s OK; we can read the ones out of the backup vault by supplying our passphrase on request for decryption.

Remote backups

If you have SSH or even just SCP/SFTP access to your storage provider’s servers, not much has to change to make duplicity(1) store the files up there instead:

$ duplicity --encrypt-key tom@sanctum.geek.nz Documents sftp://user@backup.example.com:docsbackup

Your backups will then be sent over an SSH link to the directory docsbackup on the system backup.example.com, with username user. In this way, not only is all the data protected in transmission, it’s stored encrypted on the remote server; it never sees your plaintext data. All anyone with access to your backups can see is their approximate size, the dates they were made, and (if you publish your public key) the user ID on the GnuPG key used to encrypt them.

If you’re using the ssh-agent(1) program to store your decrypted private keys, you won’t even have to enter a passphrase for that.

The duplicity(1) frontend supports other methods of uploading to different servers, too, including the boto backend for S3 Amazon Web Services, the gdocs backend for Google Docs, and httplib2 or oauthlib for Ubuntu One.

If you like, you can also sign your backups to make sure they haven’t been tampered with at the time of restoration, by changing --encrypt-key to --encrypt-sign-key. Note that this will require your passphrase.

Restoring

Restoring from a duplicity(1) backup volume is much the same, but with the arguments reversed:

$ duplicity sftp://user@backup.example.com:docsbackup docsrestore
Synchronizing remote metadata to local cache...
GnuPG passphrase:
Copying duplicity-full-signatures.20130727T053433Z.sigtar.gpg to local cache.
Copying duplicity-full.20130727T053433Z.manifest.gpg to local cache.
Copying duplicity-inc.20130727T053433Z.to.20130727T053636Z.manifest.gpg to local cache.
Copying duplicity-new-signatures.20130727T053433Z.to.20130727T053636Z.sigtar.gpg to local cache.
Last full backup date: Sat Jul 27 17:34:33 2013

Note that this time you are asked for your passphrase. This is because restoring the backup requires decrypting the data and possibly the signatures in the backup vault. After doing this, the complete set of documents from the time of your most recent incremental backup will be available in docsrestore.

Using this incremental system also allows you to restore your data in the state in the last backup before a given time. For example, to retrieve my ~/Documents directory as it was three days ago, I might run this:

$ duplicity --time 3D \
    sftp://user@backup.example.com:docsbackup \
    docsrestore

You can extend this to only restore particular files for large vaults, if you only need a particular file from the vault:

$ duplicity --time 3D \
    --file-to-restore private/eff.txt \
    sftp://user@backup.example.com:docsbackup \
    docsrestore

Automating

You should run your first full backup interactively to make sure it’s doing exactly what you need, but once you’re confident that everything is working correctly, you can set up a simple Bash script to run incremental backups for you. Here’s an example script, saved in $HOME/.local/bin/backup-remote:

#!/usr/bin/env bash

# Run keychain to recognise any agents holding decrypted keys we might need
# (optional, depending on your SSH key setup)
eval "$(keychain --eval --quiet)"

# Specify directory to back up, GnuPG key ID, and remote username and
# hostname
keyid=tom@sanctum.geek.nz
local=/home/tom/Documents
remote=sftp://user@backup.example.com/docbackups

# Run backup with duplicity
/usr/bin/duplicity --encrypt-key "$keyid" -- "$local" "$remote"

The line with keychain is optional, but will be necessary if you’re using an SSH key with a passphrase on it; you’ll also need to have authenticated with ssh-agent at least once. See the earlier article on SSH/GPG agents for details on this setup.

Don’t forget to make the script executable:

$ chmod +x ~/.local/bin/backup-remote

You can then have cron(8) call this for you every week, running it as your user, by editing your user crontab(5) file:

$ crontab -e

The following line would run this script every morning, beginning at 6.00am:

0 6 * * *   ~/.local/bin/backup-remote

Tips

A few general best practices apply to this, consistent with the Tao of Backup:

  • Check that your backups completed; either have the output of the cron script mailed to you, or log it to a file that you check at least occasionally to make sure your backups are working. I highly recommend using an email message, and including error output:

    MAILTO=you@example.com
    0 6 * * *   ~/.local/bin/backup-remote 2>&1
    
  • Run backups to your local servers too; this might prevent your backup provider from reading your files, but it won’t save them from being accidentally deleted.

  • Don’t forget to occasionally test-restore your backups to make sure they’re working correctly. It’s also wise to use duplicity verify on them occasionally, particularly if you don’t back up every day:

    $ duplicity verify sftp://user@remote.example.com/docbackups Documents
    Local and Remote metadata are synchronized, no sync needed.
    Last full backup date: Sat Jul 27 17:34:33 2013
    GnuPG passphrase:
    Verify complete: 2195 files compared, 0 differences found.
    
  • This incremental system means that you’ll likely only have to make full backups once, so you should back up too much data rather than too little; if you can spare the bandwidth and have the space, backing up your entire computer isn’t really that extreme.

  • Try not to depend too much on your remote backups; see them as a last resort, and work securely and with local backups as much as you can. Certainly, never rely on backups as a version control system; use Git for that.