Roundup of Data Backup and Archiving Tools

Here is a comparison of various data backup and archiving tools. For background, see my blog post in which I discuss the difference between backup and archiving. In a nutshell, backups are designed to recover from a disaster that you can fairly rapidly detect. Archives are designed to survive for many years, protecting against disaster not only impacting the original equipment but also the original person that created them. That blog post goes into a lot of detail on what makes a good backup or archiving tool.

Comparison table

Let me give you the comparison here, and explain the features and their significance below.

Feature backuppc bacula (community edition) borg dar git-annex
Storage type reference-counted file tree archive files reference-counted chunk tree archive files sym/hard-link file tree by hash; history via git
Supports streaming-only (tape, etc) no yes no yes no
Can save backup to pipe/FIFO no yes (FIFO only) no yes (pipe and FIFO) no
Asynchronous backups possible no no no yes yes
Multi-volume support no yes no yes yes
Single files larger than a volume no yes no yes no
Individual backup larger than a volume no yes no yes yes (with separate repo)
Volume identification n/a volume label (except for FIFO) n/a backup filename + slice number repo name
Backup rotation / pruning automatic per configured rules automatic per configured rules CLI prune call with rules manual (SaraB/Baras provides CLI with configured rules) CLI drop call to delete old data
Deduplication file-level common base only (paid version has more) block-level common base only file level
Compression zlib at storage; ssh/rsync transport zlib lz4, zstd, zlib, lzma lz4, zstd, zlib, bzip2, lzo, xz no
Can avoid re-compressing no no no yes (based on extension; configurable) n/a
Binary deltas at transport, not storage no yes yes no
Supports encryption no data only (filenames & EAs unencrypted) yes (symmetric) yes (both public key with gpg and symmetric) with certain special remotes
Zero-trust target no moderate (risk of forced keys to client) yes (if targeted by only 1 machine) yes with certain special remotes
Authentication / verification no X.509 RSA file signatures HMAC-SHA256 gpg-signed session key, detached sha512, par2; any pipe secure hashes and signed commits
Can directly back up Windows machines if rsync installed if agent installed no yes yes (if git installed)
Can directly back up *nix machines if rsync installed if agent installed yes yes yes (if git installed)
Can directly back up Mac machines if rsync installed if agent installed yes yes yes (if git installed)
Preserves Mac resource forks no yes yes yes no
Preserves timestamps yes yes yes yes no
Preserves *nix hard links yes yes no yes no
Preserves *nix symlinks yes yes yes yes no
Preserves *nix EAs and ACLs yes yes yes yes no
Preserves *nix ownership (uid/gid) yes yes yes yes no
Preserves *nix sparse files no yes simulated yes no
System model daemon on storage; pull via rsync daemons everywhere; pull CLI CLI, C++ library, Python library CLI
Network/remote support Backs up systems using rsync+ssh scheduler; backs up from/to multiple systems push to remote using ssh+borg push to remote on any curl backends, SFTP, ssh, or pipe push to any of numerous special remotes or ssh+git-annex
GUI available native web interface yes yes (vorta) yes (gdar, DarGUI) limited web interface focusing on synchronization
Restoration without using tool no no no no file data but not tree updates
External runtime dependencies rsync, Perl MySQL or PostgreSQL Python none git
Standalone binary distribution no no, but bls/bextract can be used in emergency dynamic (includes Python) dynamic or static for multiple platforms, from author or distro dynamic (requires external git)
Disaster recovery method mount, reconfigure hosts bscan to rebuild DB normal commands normal commands normal, but may need to rescan repos
Scheduling internal internal external external or wrapper script external
Supported platforms for storage local *nix *nix, Windows, Mac local or ssh *nix, Mac local or ssh *nix, Mac, Windows; curl; SFTP local or ssh *nix, Mac, Windows; special remotes
Supports incremental backups yes yes yes yes yes
Supports decremental backups yes* no yes* yes yes*

Here’s what the different features mean:

Storage type
What the backup storage looks like. backuppc’s reference-counted file tree is a tree on the filesystem, where each file corresponds to a file on the original. borg’s reference-counted chunk tree encodes each block of files as a file on the filesystem. archive files are large files that group multiple files into one. git-annex’s tree is similar to a reference-counted file tree, but achieves that via links. Of these, the archive files provide the most flexible storage, since they don’t even require a filesystem (and can be put on tape directly), while the reference-counted chunk tree represents the most efficient; see deduplication below.
Supports streaming-only
Can the backup system write to devices that do not support random access? That would include things such as tapes and pipes.
Can save backup to pipe/FIFO
For those that support streaming-only, can they write the backup data to a pipe or FIFO (named pipe)? These could allow them to be, for example, streamed over ssh.
Asynchronous backups possible
If yes, the system being backed up and the ultimate storage destination do not have to be reachable over the network in real time. This means they support Asynchronous Communication (such as NNCP or Filespooler), which facilitates things like Airgaped backups and using sneakernet to have temporary storage on portable devices to transport the data to its ultimate storage host.
Multi-volume support
Whether the backup system supports more than one volume for storing backups. Here a volume means a removable drive, a tape, an optical disc, or something similar. OS or hardware tricks to aggregate drives (eg, RAID) don’t count here.
Single files larger than a volume
If the system supports multiple volumes, whether it can split a single file across multiple volumes
Individual backup larger than a volume
If the system supports multiple volumes, whether it can split a backup session across multiple volumes
Volume identification
How a multi-volume-capable system identifies volumes
Deduplication
Whether the backup system can detect duplicate data in the backup set and store it once. Block-level is the most efficient, as it detects common parts of files. File-level will typically hash files. Common base means that you can use a single base backup (eg, an installed OS when you back up multiple machines) and base incrementals on that, and is the least flexible.
Compression
Whether the backup system supports compression, and if so, what kind.
Can avoid re-compressing
As a performance optimization, whether the backup system can avoid re-compressing already-compressed data
Binary deltas
Traditional backup systems will take any change in a file, even one bit, as a reason to store an entirely-new copy of that file. Binary deltas store a more efficient representation of the difference, which can be used to bring the previous file to the new state. BackupPC supports binary deltas over the network, but not at storage. borg and dar support binary deltas both over the network and at storage.
Supports encryption
Whether and how the system can generate encrypted backups.
Zero-trust target
If a system supports encryption, whether the host storing the backup data can be prevented from decrypting it. “Yes” is best.
Authentication / verification
Whether a backup system provides integrated authentication of the backup data. With some, this is integrated with the encryption code and may require encryption (eg, Bacula). With others, such as git-annex, it is totally separate. dar provides two built-in options: --sign which signs the encryption key used for the session, and --hash which computes a SHA-512 hash while writing the archive and writes it to a separate file once the archive is written. It also integrates with par2 to create par2 signatures. Since dar creates archive files like tar does, it can also be used with any other tool that can sign data on disk or on a pipe; for instance, gpg could be used to provide stronger assurances than the built-in --sign.
Can directly back up … machines
Whether the program can back up machines running certain operating systems without using external helpers (sftp, etc). “*nix” means Unix/Linux/BSD.
Preserves …
Whether the backup system saves and restores given types of metadata. To preserve a hard link, the backup program must, at restore time, hard link together the exact same set of files that were hard linked in the source data, and no others (even if identical by content). Borg’s simulated support for sparse files means that it saves holes as blocks of NULLs at backup time, and can convert blocks of NULLs to holes at extract time. This doesn’t necessarily preserve the exact sparse structure of the original file, but should achieve roughly similar storage gains.
System model
How the system works. backuppc runs a daemon on the system doing the storage, which pulls data from the systems being backed up using rsync. Bacula had a director daemon that performs scheduling and coordination, a storage daemon that runs on the system(s) providing storage, a file daemon running on systems being backed up, and also requires a PostgreSQL or MySQL database. The CLI tools typically are invoked by command line command (which may be invoked by cron or systemd).
Network/remote support
How it supports having the backup and the source data on different machines. BackupPC can use rsync over ssh. Bacula uses the daemons as noted, which can communicate over a network. borg can push to a remote over ssh, so long as borg itself can be executed on the remote. dar can push to a remote using backends supported by libcurl, or SFTP, or any command that can be piped to. git-annex has a set of special remotes that can be pushed to, though they may not necessarily preserve all metadata.
GUI available
Whether a graphical interface is available, and what type. Third-party FLOSS projects provide these for borg and dar. BackupPC uses it as its primary interface. git-annex’s assitant provides Dropbox-like synchronization with its web interface, but doesn’t work well with all workflows git-annex makes possible.
Restoration without using tool
Whether you can restore data without using the particular backup tool used to create it. Of these, only git-annex has some support here; it could let you at least access the file data, even if you may wind up with duplicate copies after renames, deletions, etc.
External runtime depdendencies
Things that must be present to run the tool. Of these tools, only dar is fully self-contained, and can be built into a statically-linked single binary on *nix platforms that has no external dependencies.
Standalone binary distribution
Whether a self-contained standalone binary is available, and if so, what kind. Borg’s standalone binary is dynamically-linked and includes the Python environment necessary to run. dar provides a statically-linked binary, built by default; a statically-linked is the most portable. git-annex provides a dynamically-linked binary, which also requires git be installed.
Disaster recovery method
How to recover the data if only the backup volumes survive a disaster. With BackupPC, you install BackupPC on a fresh system, configure the hosts, and then can restore. Bacula would have you make a fresh install, then use bscan to load the information about volumes into its database. Bacula does support bls/bextract commands as well, but their usage is complex and impractical for most. borg and dar would just have you use the same commands as usual, since they don’t require any external configuration. git-annex may need to have you do git annex sync from repos to reload their statuses, but otherwise doesn’t need anything special - IF you have saved the git metadata somewhere.
Scheduling
How the backup system schedules backups. “Internal” means the backup software has a daemon running that does its own scheduling, often with limits on simultaneous backups and such that it can enforce. External means something like cron handles the scheduling.
Supported platforms for storage
Built-in support for backup destinations. “Local” means storage local to the backup software. “ssh” means via ssh to another system running the backup software. Dar supports libcurl destinations (https/ftp/sftp/etc). git-annex has support for special remotes for various targets. Since dar is a pipe-friendly CLI program, it can be combined with others to support a wide variety of schemes; for instance, rclone to cloud. Emulations such as the Windows Subsystem for Linux don’t count as Windows support; here I mean native support.
Supports incremental backups
Whether the backup system supports storing just the changes since the last backup. All systems here do.
Supports decremental backups
Whether the backup system supports storing the most recent backup as a full backup, then deltas running back in time – sort of the opposite of a traditional incremental. backuppc, borg, and git-annex use a storage format that is equally efficient going forwards and backwards, so I rated them each as “yes*”.

Features every program here has

  • Included in the Debian distribution and many others

  • Supports random access efficient enough to extract a single file without reading an entire backup, when the underlying device supports random access

Overview of the tools and analysis

BackupPC

BackupPC is a single-daemon system that backs up remote systems using rsync. This means that network bandwidth is used efficiently. It stores the files in a file-level deduplicated directory tree. It is a simple solution for basic backups of *nix machines when the backups fit on a standard filesystem.

Bacula

Bacula has its heritage in the tape backup world. It supports full backups and incrementals in the traditional sense. It keeps a database (in PostgreSQL or MySQL) of the different volumes and their contents. This is used both to calculate what media is needed for a restore as well as implement volume reuse rules (only allowing a full to be overwritten when a more recent full exists, for instance). It is the only tool here to provide automation around many-to-many storage relationships (can back up many systems to many storage systems) and provides the most sophisticated automation around volume management. On the other hand, it is also the most complex to install and set up, requiring its own daemons on every relevant system, as well as a database server. The complexity of restores may be a problem for decades-long archival, but on the other hand, those making heavy use of removable media may appreciate its flexibility. Its real target is the enterprise market, and a commercial version adds additional features.

Borg

Borg does backups to a filesystem. Borg’s emphasis is on efficiency; it is most efficient both over the network and on disk of all the tools here. Its on-disk format is a filesystem tree consisting of deduplicated chunks of files, which can also be compressed. Therefore, if you move a file – even from one machine to another – it will have to be neither re-transmitted nor stored again, because borg’s deduplication will detect this. By supporting binary deltas, it also efficiently stores changes to files. It is the best solution for very slow network links or situations where storage space is at a premium. On the other hand, it has its own repository format that should ideally have the time-consuming borg fsck (can take days) run periodically, and backups can be slow. Borg doesn’t support multiple volumes.

Dar

Dar represents a kind of next-generation tar. It is a command-line program that is supremely flexible, offers integrated par2 support, and is designed to integrate well with external tools. I’ve written a lot about dar; my dar page has links to my articles. Of all these tools, dar is the most flexible about storage, since it can be used in a pipeline. It also supports tape drives, with hooks allowing you to run commands to, for instance, operate a changer or have an operator switch tapes. Its isolated catalogs feature makes for efficient tracking of backed-up data without requiring a separate SQL database as with Bacula. You could look at dar as an all-around most flexible option. While it’s not quite as efficient on-disk as borg, or doesn’t have quite the level of built-in volume management sophistication as Bacula, it does pretty well compared to both – and also is a better tar than tar, a better zip than zip, and is the most “Unixy” of all of these due to its ability to be used in pipelines. It can be thought of as a powerful filesystem differ/patcher, or the workhorse of your own backup scripts. It is also the most standalone of all the tools here, being able to be functional as just a single statically-linked binary.

git-annex

git-annex isn’t designed as a backup tool at all, but it has a robust feature set that allows it to be used in such a way. It is more of a data-tracking and moving application. Uniquely, if certain care is used, backed-up data can be presented as plain files along with metadata, meaning that a worst-case scenario of a restore by an unrelated person in the future might at least get at your family photos, even if there are 5 copies of each due to renames; using a full git-annex would resolve that situation.


Sometimes we want better-than-firewall security for things. For instance:

dar is a Backup and archiving tool. You can think of it as as more modern tar. It supports both streaming and random-access modes, supports correct incrementals (unlike GNU tar’s incremental mode), Encryption, various forms of compression, even integrated rdiff deltas.

Here are some (potentially) interesting topics you can find here: