Frost backup program


I needed a tool I can rely on to backup my work and personal data.
I'm a bit paranoid, so I don't trust external company for my personal stuff (whatever their reputation at a given time).
This tool should provide every feature I need in the best possible implementation and well tested.
I don't want to manage backup (it should be transparent), I don't want to have incremental / full backup (we are no more in the 90s).
Yet, the backup should be as small as possible, encrypted and indexed.

As far as I've searched, there is no integrated solution for my needs.

I've found very close software, but they all fail to provide all the features I need for my backup solution:

How it works

Typically, one would compile Frost by issueing a "make" in the source folder. On successful building, everything is explained by issueing "./frost --help".

Frost backups the given directory to another directory that'll contain the encrypted and compressed data (called multichunk).

If you need to save your data to a remote server, mount the server with FUSE on the destination directory, and the backup will transparently become off-site

Frost requires/makes these specific files:

  1. The key vault file. This file is protected by a password, yet, it's more secure if not sent on the server, but kept on your local filesystem / USB key. This file is used to store the master key that's protecting the complete backup set. Loosing this file means loosing your backup.
    It's usually called "keyVault" on the filesystem, and you can set its complete location in the path with the --keyvault option.
  2. The index file. This file is a SQLite3 database, used to match the "chunks" to the initial directory hierarchy. It contains all the backup revisions, and the hash for the data. Storing this file on the remote server gives any attacker the list of the files in the backup, but not the mean to access their content.
    It must be called "__index.db" on the filesystem. You can set its folder location in the path with the --index option.
  3. The backup set. These files are encrypted and compressed multichunks. They store individual "chunks" from each file being backed up, resulting from the deduplication process. From a usual point of view, it's like an archive, that's split in pseudo fixed size segments. Loosing them means loosing (part of) your backup set.
    They are named "Base16EncodingOfMultchunkHash.#" on the filesystem. You can set their folder location with the --remote option.

How to use it

Minimal use case

Once built and installed, you should perform a self test to ensure the code runs as expected

$ Frost --test key && Frost --test roundtrip && Frost --test purge # They should all return "Success"

You can backup a folder like this:

$ Frost --backup /path/to/backup --remote /where/to/store/backup

You can list all revisions of your backups like this:

$ Frost --list --index /where/backup/are/stored

You can get a more described file list like this:

$ Frost --filelist --index /where/backup/are/stored

You can restore the last revision of a backup like this:

$ Frost --restore /path/to/restore/to --remote /where/backup/are/stored

You can view a single file (cat) from a backup like this:

$ Frost --cat /path/to/file/to/restore/in/backup --remote /where/backup/are/stored 1>outFile

To purge up to a given revision:

$ Frost --purge #revision# --remote /where/backup/are/stored # drop from first revision up to (including) #revision#

Advanced guide

To get more performance out of the system, if you have a limited upload bandwidth, and/or limited storage space on the remote:

# Mount the remote server's filesystem on /mnt/remote before this
$ mount [...]
$ Frost --backup /path/to/backup --remote /mnt/remote --compression bsc

To restore faster when used BSC compression (takes more memory):

$ Frost --restore /path/to/restore/to --remote /mnt/remote --cache 100M

To purge cleanly as much as possible:

$ Frost --purge revision --remote /mnt/remote --strategy slow --compression bsc

To restore only a specific revision:

$ Frost --restore /path/to/restore/to #revision# --remote /mnt/remote

To backup with a file containing per-line exclusion rules:

$ Frost --backup /path/to/backup --remote /mnt/remote --exclude exclusion.list

To skip compressing already compressed data (like JPG, MP3, etc):

$ Frost --backup /path/to/backup --remote /mnt/remote --entropy 0.965

To compute your data entropy level so you can decide which threshold to use for compressing data:

$ tar -cf shouldCompress.tar /path/to/compressibleFiles; tar -cf shouldNotCompress.tar /path/to/notCompressibleFiles
$ Frost --test entropy shouldCompress.tar 2>&1 | grep '>>>'; Frost --test entropy shouldNotCompress.tar 2>&1 | grep '>>>';


Anytime, use --help to get help:

$ Frost --help; Frost --help security; Frost --help regex; Frost --test help


To mount your backup set locally via FUSE:

$ make build/Frostfuse; ./Frostfuse /where/to/mount --remote /mnt/remote [optional: --keyvault /path/to/keyVault]
$ cd /where/to/mount; cd [rev number]; echo "Enjoy!";


Technical discussion

A good start to read is this thesis

First, the most important feature is deduplication, and it must be done at the source level (before compression, and obviously, encryption). Relying on an external program to encrypt/decrypt the data is not viable.

The software must split the input data in chunks, and the chunks size can't be fixed (else, a single byte insert in a file would transform all chunks starting from the insert, and thus prevent deduplication algorithm to find out the similarities).

The software will then use a rolling checksum property (which proved technically superior to Rabin fingerprinting [1]) in addition to a large checksum algorithm, to split the input data in chunks at some "checkpoint" places. Then, it stores the index / metadata / checksums for each chunk in an index file that needs to be saved locally (and backup remotely). This index file contains "fingerprint" for all chunks, and the data decomposition algorithm (which chunks are used by each file).

Finally some compression is applied on the chunks (based on the computed entropy level, there's not point in compressing already compressed data), a encryption is done before saving the output chunk to the final server

Restoring requires the index files, which links actual backup hierarchy to the different chunks on the server

While Frost does not do any server communication on its own but instead relies on FUSE layer to show the server's as a filesystem, it's important to figure out the minimum requirement on the server / client protocol for anyone interested in implementing a FUSE layer.

The minimum commands to be supported are:

Also, to allow fast operation on an assymmetric link, most network communication should be carefully thought about. Typically, chunks will likely be small, but we would like to limit the overhead of a single chunk transfer, so we are going to aggregate chunks to so overhead becomes negligible. Frost was designed with this idea, and a FUSE wrapper providing the minimum requirement above will give the best behavior directly.

The implementation

Crypto

OpenSSL is used for the cryptographic algorithms.
AES256 is used for the symmetric algorithms, SHA256 for the 256-bit hashing functions, PBKDF1 for the password encryption, ECIES for the asymmetric encryption (Elliptic curve Integrated Encryption Scheme)

Key generation

512 bits of random data from the system random source are hashed into the 256 bits master key.

An asymmetric key pair is generated (public and private). The master key is then encrypted with this key pair and saved in the database.

The private key is encrypted with a symmetric encryption (AES256) and saved in the keyvault

The symmetric encryption is initialized with a key derived from a password the user must provide (using Password Based Key Derivation Function and SHA256 as the hash function). There is no IV vector as ECB is used for a single block anyway.

Backup process

After a multichunk is generated, and after it's compressed {might be avoided}, it's encrypted with a symmetric cryptographic algorithm that's unique per multichunk.

The symmetric mode of encryption is used in CTR block mode.
The nonce is derived from the SHA256(Multichunk) ^ counter
The counter is reset for each multichunk (each multichunk is independant of the others).
The key used for the encryption is derived from the asymmetric encryption algorithm (see below).

It's updated at pseudo-regular interval, and synchronization points are used to figure out if the next block is a salt used to update the key or a ciphertext. Typically, a key is build like this: ( || denotes concatenation. )

 
random Salt (256 bits) is generated.
key = KDF(Salt || MasterKey)

cipheredChunk = Salt
for each block in the multichunk:
   nonce = SHA256(Multichunk) ^ counter
   cipheredChunk = cipheredChunk || AES_CTR(key, Multichunk, IV = nonce)
   if length(cipheredData) > securityThreshold:
       create new key (see above)
       cipheredChunk = cipheredChunk || newSalt

The initial secret is generated by pure random entropy and is > 256 bits (refer to Key generation above).

A asymmetric algorithm is used to encrypt this initial secret, and the ciphertext is saved in the key vault. When starting Frost, the secret is uncyphered from the public key (and a password)

The secret is hashed into a master key. This master key called Mkey later on, is used to derive the block encryption keys.

Notice that SHA256(Multichunk) is stored in clear in the index database as it's used to assert the multichunk was not corrupted in the remote place.
This data is being used for initialisation vector of AES encryption. This is considered safe, as an attacker with the index database will still need the master key to decrypt the multichunk

Weakness of the algorithm

A lot of random is used in the process to ensure cryptanalysis will be hard. However, if the system does not provide enough entropy, a pseudo random generator is used. This is how the Linux Kernel deals with randomness, and unfortunately, not something to overlook.

Unless you are storing very secret information, this scheme is used in most cryptographic software.

If someone get hands on the keyvault, (s)he can try to brute force the password (or use a dictionnary of most common used passwords). Do not use a simple password!

The analysis will still be difficult, since there is no "good" result of decryption, but if the attacker knows a big file you have, then cryptanalysis can help him/her figure out if the tested password is working.

Compression

Zlib

This is the default compression library. The compression ratio is poor, yet the compression is fast.

This is more important when purging old data from the backup set, because it's required to decompress the data and compress the remaining one. If you have limited space (or bandwidth) on your server, then you should use the other method.

On the other hand, if you have plenty of space or bandwidth, then a quick algorithm will prove faster and the backup process will take less time

BSC

The Block Sorter Compressor algorithm must be enabled explicitely. It was written by Ilya Grebnov, integrated in Frost by me.

It works better with big blocks of data to compress. It's faster than LZMA (yet, it's also using Lempel Ziv Prediction over a BW transform), and produces better compression ratio.

Unfortunately, it's slower than Zlib, and less common, so caution kicks in here, and Zlib stays the default.

Please notice that checksum are used everywhere, so any corruption in the data will be detected anyway.

Data processing schematic

Hover the block to get more information

Backup process

Directories
Files
Chunks
Multichunks
Compression
Encryption
Output files
The software scans directories to find out the files it contains.
Files are cut in chunks whose size is based the file's content.
Existing chunks are not saved twice, but the index still reference them.
Remaining chunks are accumulated in multichunks.
Multichunks are compressed to avoid backup space loss.
Compressed multichunks are encrypted and are ready for being saved to disk.
Output files contains both multichunks and index database.

Restore process

Backup files
Multichunks
Encryption
Compression
Chunks
Directories
Files
Index database is analyzed and matched against the backup files.
Required multichunks for the selected revision are verified.
Required multichunks are then decrypted.
Required multichunks are then decompressed.
Chunks are extracted from the multichunks.
Required directories are created.
Files are created and their content regenerated.

FUSE implementation

Requirements

If you have FUSE installed on your system (API version 26 or later), including the headers, then you can build the Frostfuse program by running:

$ make build/Frostfuse

If completed successfully, you should be able to mount your backup set as a FUSE filesystem.

The implementation mounts a read-only access to your backup set that is organized as follow:

The arguments for this software are:

Frost Fuse specific options:
	--password=password               The password to use to decypher the master key [BEWARE OF YOUR BASH HISTORY], this is optional
	--remote=/path/to/remote          The path where the remote is stored
	--index=/path/to/index            The path where the index file is stored (if empty, using remote path)
	--keyvault=/path/to/keyvaultFile  The path where to the key vault file (if empty, using ~/.frost/keys)

Performances

The implementation makes use of threads from FUSE. Yet, altrough file's metadata retrieval should be pretty fast (listing, getting the statistics, etc...), accessing file content requires decyphering and decompressing the multichunks. This is as fast as the Frost --restore function, but because of the multithreading access, the multichunk cache could be missed for many chunks if accessed in parallel by numerous process.

So far, it's still usable correctly without any noticeable slow down from any other archive based filesystem.

Limitations

File permission are not validated inside the backup set, so this means anyone could read from any file from the backup set provided she has access to the mount point.

Similarly, even if the mount is read-only, the files appears with their initial attributes (and permissions), so it can be disturbing for other software to see files marked as "rw" and not be allowed to write to them.


Source code for POSIX system (Linux and Mac OS X).

The solution might be buildable on Windows, with mingw or Microsoft compiler, but I have not tested this. In all case, there is no gui, so the backup would have to be run in a console.
Fork me, I'm famous