More musings on S3 backups

So I’ve got automated backup scripts on the servers, and have setup an S3 bucket with a life cycle policy to move things to cheaper storage over time.

Now I’m thinking about the best way to do the same for the various systems I’ve got at home, I’m thinking that just backing up the home directory for myself should cover 99% of any problems, as there’s not much in the way of services that need config files backing up running.

I’ve been giving it some thought as to the best way to upload everything to S3 encrypted, keep things up to date, and to keep the data storage costs low.

Using the s3cmd tool and ‘sync’ would mean that it cannot encrypt the files locally before uploading, as the md5 hashes provided by S3 wouldn’t match the local (unencrypted) files, so I’ll need to encrypt things locally before upload, but that creates a problem that if I tar and compress files, if one files changes, the whole file will have to be uploaded each time.

The solution I’ve decided on is to individually compress each directory  into a single tar.gz file, encrypt locally, then use ‘s3cmd sync’ – if nothing in a directory is changed, the files will match, so for mostly static files (photos, music), once the initial file is uploaded that’ll be the only time it happens, content that changes (documents) compress down well, so uploading them shouldn’t be too much of an issue.

I’ve enabled versioning on the bucket, then configure it to keep the most recent version, transitioning it into cheaper storage over time, and to delete the non-current version after a short period of time.

I started writing a script in Bash to automate all of this, but it rapidly became unwieldy, as at one point there was four pipes in a single command(!). It did work, but after running it once to confirm how long it’d take to do everything it came in at ~90 minutes, which is a ludicrous amount of time. So after thinking about how to reduce that, sanity prevailed, so I started from scratch in Python 🙂

It now does the same thing, but stores the md5 hashes, to compare the saved version with the ‘live’ one, then re-creates the files only if they don’t match. Another thing which helped massively was disabling compression in gpg. Creating all the initial files took about 60 minutes, re-scanning them all takes about 15, so it’s quick enough to run frequently 🙂

Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*