Personal data storage

These are my data storage notes, targeting primarily personal data backups: regular files (documents, photo and music collections, not databases), moderate volume, added or edited rarely, backups are managed manually.

General approach

The "3-2-1 rule" for backups suggests to keep at least 3 copies of data, on at least 2 different storage devices, with at least one copy off-site.

The exact requirements and methods to achieve those may depend on one's threat model: in addition to device failures, bit rot, and unauthorized access by scrapers, one may have to consider fire or flooding, burglaries and robberies, book burning campaigns and censorship with isolation, hardware seizures and imprisonment without ability to maintain the remaining backups for years, inability--or a limited ability--to acquire replacement storage devices, and even uncommon and hypothetical scenarios, such as a global high energy EMP.

Considering the information security "CIA" triad (confidentiality, integrity, availability), we need encryption, so that lost or decommissioned drives will not leak personal data (i.e., crypto-shredding can be employed); integrity checking, so that we will either read back the data that was written or detect data corruption (and preferably even repair it); varied and common technologies (hardware interfaces, drivers, filesystems, file formats), so that there will be a good chance that at least some of the backups can be accessed with reasonable effort in different situations in the future.

Most of the technologies covered here are usable for both backups and working storage. I prefer to use more general tools, since they tend to be better maintained, and learning them usually is a more useful time investment than learning specialized backup systems (but for those, see Bacula, Borg, restic, DAR), some of which are quite similar to actual file systems (e.g., Borg is), while apparently often lacking error correction codes and redundancy within a single repository, but those may still be suitable for the task. Fortunately in this case the variety is preferable, and one can combine those. See also: Debian Reference Manual - 10. Backup and recovery, BackupAndRecovery - Debian Wiki, Synchronization and backup programs - ArchWiki.

Hardware

Reliable computer hardware is desirable to minimize errors and hardware failures: an UPS, ECC memory, and quality hardware (including storage) in general.

External HDDs (or combinations of internal ones and external boxes) are inexpensive and handy for local backups, allowing to keep them safely disconnected most of the time, and to easily plug into virtually any computer when needed.

USB flash drives seem more suitable for off-site backups, being more robust for physical transfer. Apparently flash memory is not suited for a long-term storage without power though, so it is suggested to have them powered up at least for a few hours per year, letting the controllers to do maintenance.

Optical drives (CD, DVD, Blu-ray) are commonly suggested for archieval, though they seem less convenient for updates and for usage in general, and it is not quite clear whether the recordable ("burned" with a laser and a dye, as opposed to being stamped at a factory) CDs and DVDs are that long-lasting.

Paper backups may be useful as well, and quite reliable, particularly for texts and images. Acid-free paper should be used for those, and one may play with bookbinding then. Some use QR codes and other two-dimensional barcodes to store arbitrary data on paper. Out of hardware, one would need a printer and a scanner for those, though I should investigate that better.

One may also consider keeping backup storage devices and related items in a specialized storage shelf, a Faraday cage, or a fire-resistant and/or waterproof safe.

To go further than that, including storage of physical items, one may also look into general archieval- and collection-related materials, such as the Preservation Self-Assessment Program.

Backup operating system

I find it useful (for the peace of mind, at least) to set a bootable operating system on at least one of the backup drives, with all the necessary software to read the backups. So there usually is EFI system partition (ESP), an unencrypted partition for /boot (GRUB2 can handle encrypted ones, but it would not make much difference), an encrypted partition for the rest of the system (to prevent possible data leaks via cache, for instance, after backups are accessed from it), and a separate encrypted partition for the backup itself.

When installing a system using an installer, on a machine with more than one disk and some existing systems present, the installer would often use a seemingly random ESP on one of the internal disks, instead of the one on the backup drive. Fixing it may involve booting via the GRUB shell after GRUB fails to find or access its config from the /boot partition, remounting (and fixing in /etc/fstab) /boot/efi/, to point to the correct drive's ESP, and then running grub-install to install it there. Also removing undesirable directories from ESP manually, and adjusting things with efibootmgr. Or one can opt for a more involved/manual installation, setting it properly at once: see, for instance, "Installing Debian GNU/Linux from a Unix/Linux System" and "Full disk encryption, including /boot: Unlocking LUKS devices from GRUB".

Storage setups

I do partitioning with fdisk, mostly because other common tools (or at least their fancy user interfaces) tend to be buggy, and/or to hide technical information, neither of which is desirable when partitioning storage devices. fdisk is nice, commonly available, and works well. With the setups described below, it works to set LUKS or an encrypted filesystems directly on a block device, without any partitioning, but it may also be desirable to store some public data backups on a separate partition of the same storage device, unencrypted.

RAID 1 (or possibly 5, 6) is nice to set if there are spare disks, but usually not as critical for redundant personal backups as it is, for instance, for a production server.

As of 2021 and for Linux-based systems, some of the common software options are:

Those can be combined, even the ones serving the same purpose: for instance, storing file checksums would not harm even if the underlying filesystem supports those already. Likewise, it should not harm to encrypt the more important files (cryptographic keys, passwords), even while storing those on encrypted disks.

Below are notes and command cheatsheets for the setups I use.

LUKS and ext4

This is probably the most basic and widely supported setup for Linux-based systems. Only authenticated integrity checks are supported by cryptsetup (and those are experimental), so no CRC and no recovery from minor errors without RAID. Perhaps dm-integrity can be set separately to use CRC32C, but that would complicate the setup. Or it can be skipped altogether, since integrity checking is experimental, and wiping can slow down the process considerably (while skipping the wiping easily leads to errors).

Initial setup:

# Optionally, add: --type luks2 --integrity hmac-sha256
cryptsetup luksFormat /dev/sdXY
cryptsetup open /dev/sdXY backup2
mkfs.ext4 /dev/mapper/backup2
cryptsetup close backup2
mkdir /var/lib/backup2

A typical session (CLI-based, though this is also handled by graphical file managers, such as Thunar):

cryptsetup open /dev/sdXY backup2
mount -t ext4 /dev/mapper/backup2 /var/lib/backup2/
# synchronize backups
umount /var/lib/backup2/
cryptsetup close backup2

When done, in order to safely eject a device, run eject /dev/sdX, or possibly udisksctl power-off -b /dev/sdX.

For RAID with mdadm, see "dm-crypt + dm-integrity + dm-raid = awesome!".

ZFS

ZFS is not modular like LUKS and friends, there are license compatibility issues, and it is rather unusual overall, but apparently a good filesystem containing all the features needed here.

Initial setup:

# Install zfsutils-linux
apt install zfsutils-linux
# Find a partition ID
ls -l /dev/disk/by-id/ | grep sda4
# Use that ID to create a single-device pool. The "mirror" keyword
# should be added to set RAID 1.
zpool create tank usb-WD_Elements_...-part4
# Create an encrypted file system.
mkdir /var/lib/backup/
# For redundancy within a dataset, add to the command below: -o copies=2
zfs create -o encryption=on -o keyformat=passphrase -o mountpoint=/var/lib/backup tank/backup

ZFS comes with its own mounting and unmounting commands, and if it is to be used from different systems, the pools should be exported and imported (or just force-imported). A typical session, assuming that it is used from different systems:

# List pools available for import
zpool import
# Import the pool
zpool import tank
# Mount an encrypted file system
zfs mount -l tank/backup
# (Synchronize backups here)
# Unmount the file system (or it will happen on export)
zfs unmount tank/backup
# Unmount the pool (also unnecessary to do manually though)
zfs unmount tank
# Export the pool
zpool export tank
# And eject or udisksctl power-off -b, as mentioned above

LUKS with Btrfs

This one is set with the DUP profile for both metadata and data, adding redundancy, and with sha256 checksums (instead of the default crc32c), to reduce chances of collisions.

Initial setup:

# LUKS, as with ext4
cryptsetup luksFormat /dev/sdXY
cryptsetup open /dev/sdXY backup
# The file system
mkfs.btrfs --csum sha256 -m dup -d dup -L backup /dev/mapper/backup
cryptsetup close backup
mkdir /mnt/backup

A session:

cryptsetup open /dev/sdXY backup
mount -t btrfs /dev/mapper/backup /mnt/backup/
# synchronize backups here
umount /mnt/backup/
cryptsetup close backup
eject /dev/sdX
udisksctl power-off -b /dev/sda

Bit rot

As mentioned above, it is important to be able detect errors with some integrity checks, but one may also aim single-device redundancy for a recovery using that single device (and a better overall chance of successful data recovery), as well as calculate checksums on top of a filesystem (e.g., for ext4, which does not support those on its own).

For integrity checking with basic checksums, one can use find and sha256sum or similar tools:

# Store checksums
mkdir checksums
find . -type f ! -path './checksums*' -exec sha256sum {} \; \
  > checksums/sha256
# Check them
sha256sum --quiet --check checksums/sha256
# Add new ones
find . -type f -newer checksums/sha256 ! -path './checksums*' \
  -exec sha256sum {} \; >> checksums/sha256

For redundant error correction codes, with ability to repair, one may employ par2 or dvdisaster (aiming optical discs), though those may be quite inefficient to use for collections of files that are updated. There are projects like blockyarchive (blkar), but just as specialized backup systems, they tend to require specialized tools to access the files backed up with them at all. A software RAID (1, 5, or 6) set on different partitions of the same device is a more time-efficient way to achieve some redandancy within a storage device, though less space-efficient, and protecting against different bit rot patterns. ZFS's "copies" parameter and Btrfs's DUP profile (for both data and metadata) do something similar, storing multiple copies of blocks within a dataset.

Other useful tools

S.M.A.R.T. monitoring and testing can be done with smartmontools, and usually supported even by external and older USB drives.

I normally use just rsync --archive for the initial backup, then rsync --exclude='lost+found' --archive --verbose --checksum --dry-run --delete to compare backups and for data scrubbing, and without --dry-run afterwards, if everything looks fine.

For data erasure, dd is handy for wiping both disks and partitions (before decommissioning drives, or if there were unencrypted partitions before), e.g.:

dd status=progress if=/dev/urandom of=/dev/sdX bs=1M
dd status=progress if=/dev/urandom of=/dev/sdXY bs=1M

GnuPG is there for individual file encryption, as well as for signing. In some cases it may be useful together with tar and gzip.

Public data backups

Public data may be useful to backup as well: its regular sources may be censored/blocked by a government, or simply become unavailable because of a technical issue (along with the rest of the Internet if the issue is near the user). In that case, the focus should be on high availability, probably along with integrity, while confidentiality hardly matters (unless it is outlawed). I think even unencrypted NTFS is good enough for this, and easily readable from any common system.

As for the data to backup (and later read) this way, Kiwix (with its OpenZIM archives) is a nice project. Its primary viewer may seem awkward for use in normal circumstances, but apparently it aims to be useful to general public and in bad circumstances: it provides archives as packages, while the viewer—with versions for every common OS—can also serve those to others in a local network via a web browser. library.kiwix.org provides, among others, indexed archives of Project Gutenberg (about 60,000 public domain books), Wikipedia, Wikibooks, Wikiversity, Wiktionary, Wikisource, ready.gov, WikiHow, various StackExchange projects, Khan Academy, and many smaller bits like ArchWiki, RationalWiki, Explain XKCD (contains the comics). As of 2022, those would take just 200 to 300 GB, even with images and some non-English versions added.

Other large and legal archives to consider for backing up: Wikimedia Downloads, Complete OSM Data, maybe Debian archive mirroring and other software archives, arXiv and other Open Access sources. If one gets into tape storage, Common Crawl can be considered too. And then there are copyright-infringing but much larger libraries like Library Genesis (blocked in Russia; a trimmed down, txt-only version used to be available at offlineos.com, but apparently not anymore), the-eye.eu books (blocked in Russia), as well as music and movies (particularly long TV series may be good for hoarding; out of nice sci-fi ones, there are Doctor Who, Star Trek, Red Dwarf, Farscape, Lexx, Firefly, Defiance, Battlestar Galactica, Babylon 5, The X-Files, First Wave; plenty more can be found in Wikipedia).

OpenStax provides good and freely available textbooks under the CC BY license, available for download in PDF. See OpenStax GitHub repositories for their CNXML sources and related tools, though in 2024 I found it tricky to build HTML out of those, and then it still was not good enough for printing. LibreTexts is supposed to be similar, though the licensing information is unclear in some cases, some links lead to HTTP 404 errors, and some of the books are quite messy (attempting to embed YouTube videos into PDFs, having every other page filled with listings of undeclared licenses, or with "welcome" messages). While its subdomains (math, phys, etc) geo-block direct requests from Russia, the books are available without proxying via commons.libretexts.org. One can also search for libre book sources on platforms like GitHub, possibly querying for TeX sources: there are occasional seemingly decent and not well-known textbooks, like Introductory Physics: Building Models to Describe Our World.

YouTube videos may be useful to hoard as well: there are many nice ones, including educational channels, and platforms like that seem to be getting blocked quickly when a government tries to block information flows (see censorship of YouTube). At 480p most videos would be watchable and not take much space (perhaps 2 to 5 MB per minute), and one can download them with youtube-dl, e.g.: youtube-dl --download-archive archive.txt -f 'bestvideo[height<=480]+bestaudio/best[height<=480]' 'https://www.youtube.com/c/3blue1brown/videos' (see also: some tricks to avoid throttling). I have collected some video links, including interesting YouTube channels. I think it is best to go after relatively information-dense ones (lectures, online lessons) first, possibly followed by entertainment-education, pop-sci, and documentaries.

Remote backups

When backing up private data to a remote (and usually less trusted) machine, it should be encrypted and verified client-side (so options like plain rsync over SSH are not suitable), but preferably still allowing for incremental backups (so tar and gpg are not suitable in general, either). One can still employ LUKS or ZFS though, by accessing remote block devices via iSCSI (in particular, tgt and open-iscsi seem to work smoothly on Debian), NBD, or similar protocols, possibly on top of IPsec or WireGuard (though as of 2024, those are blocked in Russia between local and foreign machines), tunnels made with SSH port forwarding, TLS (e.g., with stunnel), or anything else establishing a secure channel, to add encryption and a more secure authentication.

A test iSCSI setup example:

# server (192.168.1.2)
apt install tgt
dd if=/dev/zero of=/tmp/iscsi.disk bs=1M count=128
tgtadm --lld iscsi --op new --mode target --tid 1 --targetname iqn:2024-07:com.example:tmp-iscsi.disk
tgtadm --lld iscsi --op show --mode target
tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /tmp/iscsi.disk
tgtadm --lld iscsi --op new --mode account --user foo --password bar
tgtadm --lld iscsi --op show --mode account
tgtadm --lld iscsi --op bind --mode target --tid 1 --initiator-address 192.168.1.3 --initiator-name foo
tgtadm --lld iscsi --op unbind --mode target --tid 1 --initiator-address 192.168.1.3 --initiator-name foo
tgtadm --lld iscsi --op bind --mode target --tid 1 --initiator-address 192.168.1.3

# client (192.168.1.3)
apt install open-iscsi lsscsi
iscsiadm --mode discovery --type sendtargets --portal 192.168.1.2
iscsiadm  --mode node  --targetname iqn:2024-07:com.example:tmp-iscsi.disk --portal 192.168.1.2 --login
iscsiadm --mode session --print=1
lsscsi
# a block device is available at this point
iscsiadm  --mode node  --targetname iqn:2024-07:com.example:tmp-iscsi.disk --portal 192.168.1.2 --logout

Apart from own (or rented) remote machines, such a setup can be used with "backup buddies", exchanging some of your local storage space for someone else's. Sneakernet-based backup buddies (that is, occasionally exchanging storage devices) is a fine and easier option for remote backup storage.

A popular option for remote backups is online services (aka "the cloud" and a few other names), with many people relying on those even in place of local backups, or any local storage (as with music and video streaming, hosted photo albums, password managers, book collections, general document storage), delegating all those worries to somebody else. It seems convenient, but decreases direct control over the data, introduces dependencies on the service providers' continued existence and continued acceptable terms of service, on network connectivity to them, on ability to transfer payments. In my--possibly unrepresentative--experience, all those are unreliable, but it may still work as a redundant backup copy for some, particularly in predictable democratic countries, with a reputable service provider. Throw in the rule of law and sensible laws (or some kind of a hypothetical anarchist or communist utopia), and one may worry less about keeping some information private, as well as about aiming long-term isolated backups of public information.

Data sharing

For less private data (perhaps for almost everything but cryptographic keys and passwords -- that is, explicit secrets), a good way to preserve it is by sharing with others: for instance, pictures from an event or gathering are commonly shared among all the participants, while creative works (particularly books and music) can be shared among people with similar interests or tastes. Everything work-related can be backed up on work machines. While the data that is not private at all, like this very note, or other own creative works under permissive licenses, is generally useful to publish, sharing even more widely.