Proposed Solution for Synchronizing the Contents of Asuran Repositories

2020-03-02

Background

The lowest abstraction level asuran provides is a Content Addressable Storage interface where the blobs, hereafter refereed to as "Chunks", are keyed by an HMAC of their plaintext.

An HMAC is used, rather than a plain hash, since these keys may leak out of the encrypted sections of the repository, and be visible in plaintext. asuran thus uses an HMAC with a key (stored securely with the rest of the repository key material) used only for this purpose, and as this key is unique to each repository, these HMAC keys end up being as good as random numbers to any would be attacker, not having access to the key material.

But why is leaking the hash of plain-text data a bad thing?

One might, somewhat correctly, come to the first assumption that leaking a cryptographic hash of the plaintext of your data is no big deal. After all, you shouldn't be able to reverse the contents of a Chunk based on the cryptographic hash of its plaintext, right?

While that is true, an attacker doesn't necessarily need to reverse the hash to extract compromising information.

Asuran's threat model is pretty pessimistic, assuming the repository is located on completely untrusted storage,, meaning an attacker is assumed to have complete, unlimited, unprotected access to the on-disk repository. To understand how an attacker might be able to extract compromising information (under this threat model) if Asuran were to use plain hashing rather than HMAC keys, we are going to come up with a bit of a contrived example.

Lets assume that your local MediaMegaCorp has recently released a new film on your favorite format, and they have been having some trouble with piracy. A major scene group has released a popular rip of the film, and in a futile effort to purge the internet of this rip, they reach out to local storage providers for help.

Meanwhile, you have purchased a physical copy of the movie, and, as you enjoyed it, dutifully (and, for legalities sake, legally) make a backup of it, for when the inevitable happens and time has rendered your physical copy unusable. You notice you happen to have used the same tools the scene group is know to use for producing their rips, and those tools happen to be known for producing byte-for-byte reproducible copies. You are aware of MediaMegaCorp's attempt to wipe the rip off the net, but pay it no mind, as you use LesserAsuran for your backups, an archiver that strives to leak nothing about its repository's contents.

A few days later, MediaMegaCorp approaches your storage provider about removing copies of the rip from their servers. The storage provider, fearing retribution in the form of a massive copyright lawsuit, complies and asks that MediaMegaCorp provides the hash of the file, which they do, we'll call it #🎞.

Your storage provider, savvy to the existence of such tools as LesserAsuran that index encrypted data based on the hash of the plaintext, not only scans their disks for files whose hashes are #🎞, but also scans the disk for the value of #🎞 itself. When they stumble upon your LesserAsuran repository, and find #🎞 in its index, they don't need to reverse the hash to know that the film itself is stored in the repository, as they already have the plaintext the hash is tied to.

Asuran effectively prevents this attack by using an HMAC instead of a plain hash function for keying. The use of HMAC ties the output of the key function to something other than just the plaintext of the object, in this case, it ties it to secret key material that either never touches the remote/untrusted storage unencrypted, or never touches it at all. Even if they have the plaintext they are searching for, the attacker can not determine¹ if your repository contains that blob without also having your secret key material.

How using a unique HMAC key poses problems for syncing data between repositories

While using an HMAC to generate the content keys has obvious security advantages, the fact that it also serves as the key for deduplication poses an issue to practical use, namely, efficiently synchronizing data between repositories.

How I would solve the problem if we used a plain hash

If Asuran used a plain hash instead of an HMAC, synchronizing archives between repositories would be dead simple. The process would be as simple as making a list of all the chunk keys in a particular archive on the local repository, then interrogating the remote repository to see which chunks it is missing. Once you have that information you could simple re-encrypt only the missing chunks with the remote's key, as well as the archive structure itself, and send them over to the remote repository for direct storage.

Using an HMAC tied to the repository's key material confounds this approach, as each repository will, presumably have different HMAC keys, and prevent the efficient detection and sending of only changed/missing chunks. The local repository would, at a minimum, have to locally decrypt all chunks and reprocess them with the remote's key to determine which ones are missing from the remote, which, while a valid strategy, is not very efficient, epically for the use case of backing up large file systems with infrequent/small changes between snapshots.

A naive, but flawed approach

One obvious approach would be to allow copying of the entire key material for a repository into a new one, and only allowing direct sync between repositories sharing key material, falling back to the inefficient "just reprocess everything" approach when this is not the case. While this is certainly a valid approach, and we probably will support doing this, it will not be the default behavior, as violates asuran's "leak nothing" policy, by making it trivial to determine if two repositories contain the same information.

One might think that sharing only the HMAC key would be sufficient, and keeping the other components of the key different between repositories, as in this case the encrypted bytes of the chunks themselves would still be different on disk, and you would still need to know the secret HMAC key to be able to determine if a repository contains a particular plaintext.

This approach, however, still leaks that two repositories contain the same information, in a less than obvious way, so I consider it unsafe.

To demonstrate this attack, lets posit a future where people share cool stuff they want to archive, but also be available to the public, by hosting public asuran archives, and sharing the password to it amongs't trusted members of the community (or even having the public repository be NoEncryption), and you might pull these files into your own asuran archives by a direct pull. Say you are subscribed to a historical document archiving group that you pull from a lot, so as a matter of convenience you clone your own personal asuran repository's HMAC key from that groups public repository. It would probably seem like no big deal, since your local repository is encrypted with a different encryption key, so it shouldn't leak anything anyway.

Lets go for a slightly more insidious example² than the last one. Assume you live in a country with a state secrets act that prohibits civilian possession and distribution of certain pieces of information, and there has been an illegal photo of your country's new stealth bomber making the rounds on certain parts of the internet. One of the maintainers of your archiving group (in my opinion, rightly so) decides that the photo of the stealth bomber should be preserved, and sees the best way of doing this as sneaking it into one of their regular archive uploads to the group repository. None the wiser, you conduct your normal weekly pull from the public repository, blissfully unaware of the illegal content that has just been so rudely thrust upon you.

Now lets assume that the original uploader has either been caught in the act, or confessed to his crime, or the government has found out through some other means, and that even though the government knows who uploaded the illegal content and when, they still do not have the key to the repository. Even though they could not positively identify which chunks contained the illegal information³, the government still use time stamp information and other side channels to make a definitive statement beyond a reasonable doubt that if a repository contains all of a specific set of chunk keys, it contains the illegal information. The government could then go to storage providers and require that they scan their disks for the offending sets of chunk ids, and if your repository contains all of them, then congratulations, you are now, at the very least on a list.

My proposed solution

Requirements

Based on the above described attacks, any efficient solution to the problem of synchronizing archives between repositories must satisfy the following properties:

Must not leak plain hashes of plaintext
Must not share any secret key state between repositories
Must not require any deep inspection⁴ of chunks that are shared between the repositories
Must require the secret keys of both repositories to determine if they share any information
Must not have any non-optional storage overhead
Must still allow synchronization, even with compute overhead, with repositories that do not have any special features enabled

My solution

I propose modifying the manifest API such that each archive entry has an optional pointer to a chunk containing the following struct:

struct IDMap {
    known_previously: Vec<ChunkID>,
    additional: BiMap<ChunkID, ChunkHash>,
}

Where known_previously is a vector of pointers to the heads of all other known IDMap trees at the time of archive creation, and additional is a bijective mapping of HMAC keys to the plain hashes of each chunk that was not previously known. Since the mapping between ChunkID and ChunkHash should be globally bijective⁵, it is trivial to walk the entire tree and union these together at run time, to construct a complete mapping.

Syncing to a remote repository can be accomplished by either interrogating the remote repository⁶ using ChunkHash rather than ChunkID.

This satisfies each property as follows:

Must not leak plain hashes of plaintext

Chunks are encrypted before hitting storage, so the plain hashes are never written in the clear
Must not share any secret key state between repositories

This just straight up is not a requirement here, the ChunkIDs are converted to a secret key agnostic format during interrogation.
Must not require any deep inspection of chunks that are shared between the repositories

The bijective map between ChunkID and ChunkHash means that determining if a chunk is present in either repository requires only a handful of constant time HashMap lookups on either end.
Must require the secret keys of both repositories to determine if they share information

As the plain hashes themselves are encrypted in storage, an attacker would only have access to the HMACs, which will still be different between repositories.
Must not have any non-optional storage overhead

As the IDMap pointer will be optional, it will be perfectly valid for a repository to just not include this information.
Must still allow synchronization between repositories that do not have special features enabled

This information can still be recovered at run time through deep chunk inspection, though at the cost of the I/O and compute that takes.

Footnotes

Due to the current on-disk storage format, there is still the potential for a chunk-length-based fingerprinting attack. We support a modified version of buzhash that partially mitigates this though the use of a randomized lookup table, and I am still looking into ways to erase knowledge of chunk length from the on-disk format. ↩
I am aware that the specifics of this example are grossly implausible, but there are an entire family of just slightly less feasible and a lot more dangerous versions of this attack, and this version just serves as illustration of the basic concepts. ↩
Unless the repository was NoEncryption, obviously ↩
i.e. complete decryption ↩
Within a single repository ↩
Over some sort of secure connection, either a TLS type connection, or just pulling the chunks from the remote repository down and decrypting locally. ↩