As far as I've searched, there is no integrated solution for my needs.
I've found very close software, but they all fail to provide all the features I need for my backup solution:
Typically, one would compile Frost by issueing a "make" in the source folder. On successful building, everything is explained by issueing "./frost --help".
Frost backups the given directory to another directory that'll contain the encrypted and compressed data (called multichunk).
If you need to save your data to a remote server, mount the server with FUSE on the destination directory, and the backup will transparently become off-site
Frost requires/makes these specific files:
Once built and installed, you should perform a self test to ensure the code runs as expected
$ Frost --test key && Frost --test roundtrip && Frost --test purge # They should all return "Success"
You can backup a folder like this:
$ Frost --backup /path/to/backup --remote /where/to/store/backup
You can list all revisions of your backups like this:
$ Frost --list --index /where/backup/are/stored
You can get a more described file list like this:
$ Frost --filelist --index /where/backup/are/stored
You can restore the last revision of a backup like this:
$ Frost --restore /path/to/restore/to --remote /where/backup/are/stored
You can view a single file (cat) from a backup like this:
$ Frost --cat /path/to/file/to/restore/in/backup --remote /where/backup/are/stored 1>outFile
To purge up to a given revision:
$ Frost --purge #revision# --remote /where/backup/are/stored # drop from first revision up to (including) #revision#
To get more performance out of the system, if you have a limited upload bandwidth, and/or limited storage space on the remote:
# Mount the remote server's filesystem on /mnt/remote before this $ mount [...]
$ Frost --backup /path/to/backup --remote /mnt/remote --compression bsc
To restore faster when used BSC compression (takes more memory):
$ Frost --restore /path/to/restore/to --remote /mnt/remote --cache 100M
To purge cleanly as much as possible:
$ Frost --purge revision --remote /mnt/remote --strategy slow --compression bsc
To restore only a specific revision:
$ Frost --restore /path/to/restore/to #revision# --remote /mnt/remote
To backup with a file containing per-line exclusion rules:
$ Frost --backup /path/to/backup --remote /mnt/remote --exclude exclusion.list
To skip compressing already compressed data (like JPG, MP3, etc):
$ Frost --backup /path/to/backup --remote /mnt/remote --entropy 0.965
To compute your data entropy level so you can decide which threshold to use for compressing data:
$ tar -cf shouldCompress.tar /path/to/compressibleFiles; tar -cf shouldNotCompress.tar /path/to/notCompressibleFiles
$ Frost --test entropy shouldCompress.tar 2>&1 | grep '>>>'; Frost --test entropy shouldNotCompress.tar 2>&1 | grep '>>>';
Anytime, use --help to get help:
$ Frost --help; Frost --help security; Frost --help regex; Frost --test help
To mount your backup set locally via FUSE:
$ make build/Frostfuse; ./Frostfuse /where/to/mount --remote /mnt/remote [optional: --keyvault /path/to/keyVault]
$ cd /where/to/mount; cd [rev number]; echo "Enjoy!";
A good start to read is this thesis
First, the most important feature is deduplication, and it must be done at the source level (before compression, and obviously, encryption). Relying on an external program to encrypt/decrypt the data is not viable.
The software must split the input data in chunks, and the chunks size can't be fixed (else, a single byte insert in a file would transform all chunks starting from the insert, and thus prevent deduplication algorithm to find out the similarities).
The software will then use a rolling checksum property (which proved technically superior to Rabin fingerprinting [1]) in addition to a large checksum algorithm, to split the input data in chunks at some "checkpoint" places. Then, it stores the index / metadata / checksums for each chunk in an index file that needs to be saved locally (and backup remotely). This index file contains "fingerprint" for all chunks, and the data decomposition algorithm (which chunks are used by each file).
Finally some compression is applied on the chunks (based on the computed entropy level, there's not point in compressing already compressed data), a encryption is done before saving the output chunk to the final server
Restoring requires the index files, which links actual backup hierarchy to the different chunks on the server
While Frost does not do any server communication on its own but instead relies on FUSE layer to show the server's as a filesystem, it's important to figure out the minimum requirement on the server / client protocol for anyone interested in implementing a FUSE layer.
The minimum commands to be supported are:
Also, to allow fast operation on an assymmetric link, most network communication should be carefully thought about. Typically, chunks will likely be small, but we would like to limit the overhead of a single chunk transfer, so we are going to aggregate chunks to so overhead becomes negligible. Frost was designed with this idea, and a FUSE wrapper providing the minimum requirement above will give the best behavior directly.
512 bits of random data from the system random source are hashed into the 256 bits master key.
An asymmetric key pair is generated (public and private). The master key is then encrypted with this key pair and saved in the database.
The private key is encrypted with a symmetric encryption (AES256) and saved in the keyvault
The symmetric encryption is initialized with a key derived from a password the user must provide (using Password Based Key Derivation Function and SHA256 as the hash function). There is no IV vector as ECB is used for a single block anyway.
After a multichunk is generated, and after it's compressed {might be avoided}, it's encrypted with a symmetric cryptographic algorithm that's unique per multichunk.
The symmetric mode of encryption is used in CTR block mode.
The nonce is derived from the SHA256(Multichunk) ^ counter
The counter is reset for each multichunk (each multichunk is independant of the others).
The key used for the encryption is derived from the asymmetric encryption algorithm (see below).
It's updated at pseudo-regular interval, and synchronization points are used to figure out if the next block is a salt used to update the key or a ciphertext. Typically, a key is build like this: ( || denotes concatenation. )
random Salt (256 bits) is generated. key = KDF(Salt || MasterKey) cipheredChunk = Salt for each block in the multichunk: nonce = SHA256(Multichunk) ^ counter cipheredChunk = cipheredChunk || AES_CTR(key, Multichunk, IV = nonce) if length(cipheredData) > securityThreshold: create new key (see above) cipheredChunk = cipheredChunk || newSalt
The initial secret is generated by pure random entropy and is > 256 bits (refer to Key generation above).
A asymmetric algorithm is used to encrypt this initial secret, and the ciphertext is saved in the key vault. When starting Frost, the secret is uncyphered from the public key (and a password)
The secret is hashed into a master key. This master key called Mkey later on, is used to derive the block encryption keys.
Notice that SHA256(Multichunk) is stored in clear in the index database as it's used to assert the multichunk was not corrupted in the remote place.A lot of random is used in the process to ensure cryptanalysis will be hard. However, if the system does not provide enough entropy, a pseudo random generator is used. This is how the Linux Kernel deals with randomness, and unfortunately, not something to overlook.
Unless you are storing very secret information, this scheme is used in most cryptographic software.
If someone get hands on the keyvault, (s)he can try to brute force the password (or use a dictionnary of most common used passwords). Do not use a simple password!
The analysis will still be difficult, since there is no "good" result of decryption, but if the attacker knows a big file you have, then cryptanalysis can help him/her figure out if the tested password is working.
This is the default compression library. The compression ratio is poor, yet the compression is fast.
This is more important when purging old data from the backup set, because it's required to decompress the data and compress the remaining one. If you have limited space (or bandwidth) on your server, then you should use the other method.
On the other hand, if you have plenty of space or bandwidth, then a quick algorithm will prove faster and the backup process will take less time
The Block Sorter Compressor algorithm must be enabled explicitely. It was written by Ilya Grebnov, integrated in Frost by me.
It works better with big blocks of data to compress. It's faster than LZMA (yet, it's also using Lempel Ziv Prediction over a BW transform), and produces better compression ratio.
Unfortunately, it's slower than Zlib, and less common, so caution kicks in here, and Zlib stays the default.
Please notice that checksum are used everywhere, so any corruption in the data will be detected anyway.Hover the block to get more information
If you have FUSE installed on your system (API version 26 or later), including the headers, then you can build the Frostfuse program by running:
$ make build/Frostfuse
If completed successfully, you should be able to mount your backup set as a FUSE filesystem.
The implementation mounts a read-only access to your backup set that is organized as follow:
The arguments for this software are:
Frost Fuse specific options: --password=password The password to use to decypher the master key [BEWARE OF YOUR BASH HISTORY], this is optional --remote=/path/to/remote The path where the remote is stored --index=/path/to/index The path where the index file is stored (if empty, using remote path) --keyvault=/path/to/keyvaultFile The path where to the key vault file (if empty, using ~/.frost/keys)
The implementation makes use of threads from FUSE. Yet, altrough file's metadata retrieval should be pretty fast (listing, getting the statistics, etc...), accessing file content requires decyphering and decompressing the
multichunks. This is as fast as the Frost --restore
function, but because of the multithreading access, the multichunk cache could be missed for many chunks if accessed in parallel by numerous process.
So far, it's still usable correctly without any noticeable slow down from any other archive based filesystem.
File permission are not validated inside the backup set, so this means anyone could read from any file from the backup set provided she has access to the mount point.
Similarly, even if the mount is read-only, the files appears with their initial attributes (and permissions), so it can be disturbing for other software to see files marked as "rw" and not be allowed to write to them.