iustin - Looking for a better backup solution

Backups!

After my last post, didn’t feel like writing for a while. But now I found a good subject: backups. Ah, backups…

I’ve run my current home-grown backup solution for a long time. Git history says at least since mid-2007 (so 11½ years), but the scripts didn’t start in Git, so 12 years is a fair assessment.

It’s a simple solution, based on incremental filesystem dumps, and back to level 0 periodically. I did use my backups to recover files (around once per year, I think), so it works, but it’s clunky. The biggest deficiencies are:

I don’t have enough space to backup everything I want to backup, if I want long-term history (since the full dumps every N units of time are costly).
Since the dump utility I use is limited to 9 levels, it also creates a limitation on how often I can make backups, which leads to too coarse backup granularity (and large at-risk intervals).
Since the dump is incremental, one needs to restore the correct archives in the correct order to get to the file… urgh!

Clearly I’m using technology from the stone-age, so this week I took a look at what’s available to replace my home-grown stuff.

But let’s make it clear first: I’m not interested in cloud-based backups as main solution. They might suit as an N+M (where M > 2) option, but not as primary/only solution. Plus, where’s the fun in delegating the technical stuff to others?

Various options

rsnapshot

The first thing I looked at, because it was on the back of my mind for a while, was rsnapshot. Its simplicity is very appealing, as well as its nice file-based deduplication, but a quick look at the current situation is not very encouraging:

it seems half-orphaned; not a very dire situation, but the fact that despite much discussion on that bug, it didn’t get a clear clojure; activity is small, the last official release was in 2015 and only a few commits since then;
low activity wouldn’t be a problem, but there are quite a few bugs filled that point to potential data loss, for example issue 141: “At certain conditions rsnapshot removes old backups without make new ones”;

Looking especially at the mentioned issue 141 made realise that the use of relative (e.g. hourly.N, etc.) timestamps is what leads to fragility in the script. Ideally the actual directories would be absolute-timestamp-based (e.g. 2018-12-09T15:45:44), and there would be just helpful symlinks (hourly.0) to these. Sure, there is the “sync_first” mode which seems safer, but it still doesn’t guarantee the correct transition since the various rotate calls are independent from each other and from the sync action itself.

Speaking of the rotate calls, the whole cron story (“create a cron entry for each cycle, and make sure to run the greater period ones before the lower periods”) points to more issues regarding the architecture of the rotation.

The conclusion was that at best, this would be a small improvement on my current solution. And since rsnapshot itself is a 4K LOC Perl script, I’m unlikely to contribute significantly to it; also, my desired changes would change the use of it significantly.

So, if this doesn’t work, what about other solutions?

borg backup

A tool very highly spoken of in the DIY/self-hosting backups is borgbackup. A quick look at it shows many advantages over rsnapshot:

space efficient storage, due to chunk-based (variable chunk? not entirely clear what’s the criteria for chunk length) deduplication, even across source filesystems/source machine/etc.
data encryption, yay!
customisable compression

It also can do off-site backups, of course, also requiring SSH access; and if the tool is also installed remotely, it’s much more efficient.

Something not clearly spoken about in the readme is the “correct” (IMHO) handling of repository maintenance: since archives are time-based and not relative, you declare pruning much more logically, along the lines of “keep only N backups older than T”. And it’s pruning, not rotation, which is very good.

Add on top the better handling of multiple filesystems/areas to be backed up, all in a single repository, and at first glance everything looks good. But a bit deeper look make me worried about a few things.

Reliability: On one hand, the archives are mountable. Which seems fancy. But it also means that without the tool working, and the metadata in good shape, you can’t access the data. A quick look at the design shows significant complexity, which means likely bugs, in the whole archive/database/chunk handling. If this would be the only way to get space-efficient compression, all would be good, but if you’re willing to give up encryption (at least for local backups this can be an acceptable trade-off), then rsnapshot plus a tool like duperemove which can do block-based deduplication (yes, it will kill performance on HDDs) seems a much simpler way to get the same result. And without the entire overhead of “your repository consists of opaque blobs” potential problem.

Of course, having significant internal state, there are tools to support this, like borg check and borg recreate, but the existence of these tools in itself confirms to me that there’s an inherent risk in such a design. A rsnapshot directory can be deleted, but it’s hard to get it corrupted.

Speaking of mounting archives, it also means that getting to your files a few hours ago is not as trivial as in rsnapshot’s case, which is simply cp /snapshots/hourly.3/desired/path/file ., without mounting, needing to come up with the right permissions to allow unprivileged users to do it, etc.

Security: The promise of isolating clients from bad servers and viceversa is good indeed. But it also has a few issues, out of which for my use case most important is the following: in order to allow clients to only push new archives, but not delete/break old ones (i.e. append-only mode), one can set a per-connection (via SSH keys forced command args) append only mode: you just need to set --append-only for that client. It gives a nice example, but it ends up with:

As data is only appended, and nothing removed, commands like prune or delete won’t free disk space, they merely tag data as deleted in a new transaction.

Be aware that as soon as you write to the repo in non-append-only mode (e.g. prune, delete or create archives from an admin machine), it will remove the deleted objects permanently (including the ones that were already marked as deleted, but not removed, in append-only mode).

So basically, the append only mode is not “reject other actions” (and ideally alert on this), but rather “postpone modifications until later”, which makes it IMHO useless.

Conclusion: borg backup is useful if you want a relatively hands-off, works well solution, but it has corner cases that kind of nullify its space savings advantage, depending on your trade-offs. So, not for me.

What would my ideal solution be?

After thinking on it, these are the important trade-offs:

File or block/chunk-based deduplication? Given native (file-system-level) block-based deduplication, a “native” (in the backup tool) seems preferred for local backups; for remote backups, of course it’s different, but then deduplication with encryption is its own story
File storage: native (1:1) or bundled (and needs extraction step); I personally would take native again, just to ensure I can get access to the files without the tool/its internal state to be needed to work
Per-file-system or global repository: ideally global, so that different file-systems don’t require separate handling/integration.

This leans more towards a rsnapshot-like solution… And then there are additional bonus points (in random order):

facilitating secure periodic snapshots to offline media
facilitating secure remote backups on dumb storage (not over SSH!) so that cloud-based backups can be used if desired
native support for redundancy in terms of Reed-Solomon error correction so that small blocks being lost don’t risk losing an entire file
ideally good customisation for the retention policy
ideally good exclusion rules (i.e. needing to add manually /home/*/.mozilla/cache is not “good”)

That’s a nice list, and from my search, I don’t think there is something like that.

Which makes me worried that I’ll start another project I won’t have time to properly maintain…

Next steps

Well, at least the next step is to get bigger harddrives for my current backup solution ☺ I’m impressed by the ~64K hours (7+ years) Power_On_Hours of my current HDDs, and it makes me feel good about choosing right hardware way back, but I can buy now 5× or more bigger hard-drives, which will allow more retention and more experiments. I was hoping I can retire my HDDs completely and switch to SSDs only, but that’s still too expensive, and nothing can beat the density and price of 10TB+ HDDs…

Comments and suggestions are very welcome! In the meantime, I’m shopping for hardware :-P