Proposed Solution for Synchronizing the Contents of Asuran Repositories
Background
The lowest abstraction level asuran
provides is a Content Addressable
Storage interface where the blobs, hereafter refereed to as "Chunks",
are keyed by an HMAC of their plaintext.
An HMAC is used, rather than a plain hash, since these keys may leak out
of the encrypted sections of the repository, and be visible in
plaintext. asuran
thus uses an HMAC with a key (stored securely with
the rest of the repository key material) used only for this purpose, and
as this key is unique to each repository, these HMAC keys end up being
as good as random numbers to any would be attacker, not having access to
the key material.
But why is leaking the hash of plain-text data a bad thing?
One might, somewhat correctly, come to the first assumption that leaking a cryptographic hash of the plaintext of your data is no big deal. After all, you shouldn't be able to reverse the contents of a Chunk based on the cryptographic hash of its plaintext, right?
While that is true, an attacker doesn't necessarily need to reverse the hash to extract compromising information.
Asuran's threat model is pretty pessimistic, assuming the repository is located on completely untrusted storage,, meaning an attacker is assumed to have complete, unlimited, unprotected access to the on-disk repository. To understand how an attacker might be able to extract compromising information (under this threat model) if Asuran were to use plain hashing rather than HMAC keys, we are going to come up with a bit of a contrived example.
Lets assume that your local MediaMegaCorp has recently released a new film on your favorite format, and they have been having some trouble with piracy. A major scene group has released a popular rip of the film, and in a futile effort to purge the internet of this rip, they reach out to local storage providers for help.
Meanwhile, you have purchased a physical copy of the movie, and, as you
enjoyed it, dutifully (and, for legalities sake, legally) make a backup
of it, for when the inevitable happens and time has rendered your
physical copy unusable. You notice you happen to have used the same
tools the scene group is know to use for producing their rips, and those
tools happen to be known for producing byte-for-byte reproducible
copies. You are aware of MediaMegaCorp's attempt to wipe the rip off
the net, but pay it no mind, as you use LesserAsuran
for your backups,
an archiver that strives to leak nothing about its repository's
contents.
A few days later, MediaMegaCorp approaches your storage provider about removing copies of the rip from their servers. The storage provider, fearing retribution in the form of a massive copyright lawsuit, complies and asks that MediaMegaCorp provides the hash of the file, which they do, we'll call it #🎞.
Your storage provider, savvy to the existence of such tools as
LesserAsuran
that index encrypted data based on the hash of the
plaintext, not only scans their disks for files whose hashes are #🎞, but
also scans the disk for the value of #🎞 itself. When they stumble upon
your LesserAsuran
repository, and find #🎞 in its index, they don't
need to reverse the hash to know that the film itself is stored in the
repository, as they already have the plaintext the hash is tied to.
Asuran
effectively prevents this attack by using an HMAC instead of a
plain hash function for keying. The use of HMAC ties the output of the
key function to something other than just the plaintext of the object,
in this case, it ties it to secret key material that either never
touches the remote/untrusted storage unencrypted, or never touches it at
all. Even if they have the plaintext they are searching for, the
attacker can not determine1 if your repository contains that blob
without also having your secret key material.
How using a unique HMAC key poses problems for syncing data between repositories
While using an HMAC to generate the content keys has obvious security advantages, the fact that it also serves as the key for deduplication poses an issue to practical use, namely, efficiently synchronizing data between repositories.
How I would solve the problem if we used a plain hash
If Asuran
used a plain hash instead of an HMAC, synchronizing archives
between repositories would be dead simple. The process would be as
simple as making a list of all the chunk keys in a particular archive on
the local repository, then interrogating the remote repository to see
which chunks it is missing. Once you have that information you could
simple re-encrypt only the missing chunks with the remote's key, as
well as the archive structure itself, and send them over to the remote
repository for direct storage.
Using an HMAC tied to the repository's key material confounds this approach, as each repository will, presumably have different HMAC keys, and prevent the efficient detection and sending of only changed/missing chunks. The local repository would, at a minimum, have to locally decrypt all chunks and reprocess them with the remote's key to determine which ones are missing from the remote, which, while a valid strategy, is not very efficient, epically for the use case of backing up large file systems with infrequent/small changes between snapshots.
A naive, but flawed approach
One obvious approach would be to allow copying of the entire key material for a repository into a new one, and only allowing direct sync between repositories sharing key material, falling back to the inefficient "just reprocess everything" approach when this is not the case. While this is certainly a valid approach, and we probably will support doing this, it will not be the default behavior, as violates asuran's "leak nothing" policy, by making it trivial to determine if two repositories contain the same information.
One might think that sharing only the HMAC key would be sufficient, and keeping the other components of the key different between repositories, as in this case the encrypted bytes of the chunks themselves would still be different on disk, and you would still need to know the secret HMAC key to be able to determine if a repository contains a particular plaintext.
This approach, however, still leaks that two repositories contain the same information, in a less than obvious way, so I consider it unsafe.
To demonstrate this attack, lets posit a future where people share cool
stuff they want to archive, but also be available to the public, by
hosting public asuran archives, and sharing the password to it amongs't
trusted members of the community (or even having the public repository
be NoEncryption
), and you might pull these files into your own asuran
archives by a direct pull. Say you are subscribed to a historical
document archiving group that you pull from a lot, so as a matter of
convenience you clone your own personal asuran repository's HMAC key
from that groups public repository. It would probably seem like no big
deal, since your local repository is encrypted with a different
encryption key, so it shouldn't leak anything anyway.
Lets go for a slightly more insidious example2 than the last one. Assume you live in a country with a state secrets act that prohibits civilian possession and distribution of certain pieces of information, and there has been an illegal photo of your country's new stealth bomber making the rounds on certain parts of the internet. One of the maintainers of your archiving group (in my opinion, rightly so) decides that the photo of the stealth bomber should be preserved, and sees the best way of doing this as sneaking it into one of their regular archive uploads to the group repository. None the wiser, you conduct your normal weekly pull from the public repository, blissfully unaware of the illegal content that has just been so rudely thrust upon you.
Now lets assume that the original uploader has either been caught in the act, or confessed to his crime, or the government has found out through some other means, and that even though the government knows who uploaded the illegal content and when, they still do not have the key to the repository. Even though they could not positively identify which chunks contained the illegal information3, the government still use time stamp information and other side channels to make a definitive statement beyond a reasonable doubt that if a repository contains all of a specific set of chunk keys, it contains the illegal information. The government could then go to storage providers and require that they scan their disks for the offending sets of chunk ids, and if your repository contains all of them, then congratulations, you are now, at the very least on a list.
My proposed solution
Requirements
Based on the above described attacks, any efficient solution to the problem of synchronizing archives between repositories must satisfy the following properties:
- Must not leak plain hashes of plaintext
- Must not share any secret key state between repositories
- Must not require any deep inspection4 of chunks that are shared between the repositories
- Must require the secret keys of both repositories to determine if they share any information
- Must not have any non-optional storage overhead
- Must still allow synchronization, even with compute overhead, with repositories that do not have any special features enabled
My solution
I propose modifying the manifest API such that each archive entry has an optional pointer to a chunk containing the following struct:
struct IDMap {
known_previously: Vec<ChunkID>,
additional: BiMap<ChunkID, ChunkHash>,
}
Where known_previously
is a vector of pointers to the heads of all
other known IDMap trees at the time of archive creation, and
additional
is a bijective
mapping of HMAC keys to the plain hashes of each chunk that was not
previously known. Since the mapping between ChunkID
and ChunkHash
should be globally bijective5, it is trivial to walk the entire tree
and union these together at run time, to construct a complete mapping.
Syncing to a remote repository can be accomplished by either
interrogating the remote repository6 using ChunkHash
rather than
ChunkID
.
This satisfies each property as follows:
-
Must not leak plain hashes of plaintext
Chunks are encrypted before hitting storage, so the plain hashes are never written in the clear
-
Must not share any secret key state between repositories
This just straight up is not a requirement here, the
ChunkIDs
are converted to a secret key agnostic format during interrogation. -
Must not require any deep inspection of chunks that are shared between the repositories
The bijective map between
ChunkID
andChunkHash
means that determining if a chunk is present in either repository requires only a handful of constant timeHashMap
lookups on either end. -
Must require the secret keys of both repositories to determine if they share information
As the plain hashes themselves are encrypted in storage, an attacker would only have access to the HMACs, which will still be different between repositories.
-
Must not have any non-optional storage overhead
As the
IDMap
pointer will be optional, it will be perfectly valid for a repository to just not include this information. -
Must still allow synchronization between repositories that do not have special features enabled
This information can still be recovered at run time through deep chunk inspection, though at the cost of the I/O and compute that takes.
Footnotes
-
Due to the current on-disk storage format, there is still the potential for a chunk-length-based fingerprinting attack. We support a modified version of buzhash that partially mitigates this though the use of a randomized lookup table, and I am still looking into ways to erase knowledge of chunk length from the on-disk format. ↩
-
I am aware that the specifics of this example are grossly implausible, but there are an entire family of just slightly less feasible and a lot more dangerous versions of this attack, and this version just serves as illustration of the basic concepts. ↩
-
Unless the repository was
NoEncryption
, obviously ↩ -
i.e. complete decryption ↩
-
Within a single repository ↩
-
Over some sort of secure connection, either a TLS type connection, or just pulling the chunks from the remote repository down and decrypting locally. ↩