Background
The lowest abstraction level asuran
provides is a Content Addressable
Storage interface where the blobs, hereafter refereed to as "Chunks",
are keyed by an HMAC of their plaintext.
An HMAC is used, rather than a plain hash, since these keys may leak out
of the encrypted sections of the repository, and be visible in
plaintext. asuran
thus uses an HMAC with a key (stored securely with
the rest of the repository key material) used only for this purpose, and
as this key is unique to each repository, these HMAC keys end up being
as good as random numbers to any would be attacker, not having access to
the key material.
But why is leaking the hash of plain-text data a bad thing?
One might, somewhat correctly, come to the first assumption that leaking
a cryptographic hash of the plaintext of your data is no big deal. After
all, you shouldn't be able to reverse the contents of a Chunk based on
the cryptographic hash of its plaintext, right?
While that is true, an attacker doesn't necessarily need to reverse the
hash to extract compromising information.
Asuran's threat model is pretty pessimistic, assuming the repository is
located on completely untrusted storage,, meaning an attacker is assumed
to have complete, unlimited, unprotected access to the on-disk
repository. To understand how an attacker might be able to extract
compromising information (under this threat model) if Asuran were to use
plain hashing rather than HMAC keys, we are going to come up with a bit
of a contrived example.
Lets assume that your local MediaMegaCorp has recently released a new
film on your favorite format, and they have been having some trouble
with piracy. A major scene group has released a popular rip of the film,
and in a futile effort to purge the internet of this rip, they reach out
to local storage providers for help.
Meanwhile, you have purchased a physical copy of the movie, and, as you
enjoyed it, dutifully (and, for legalities sake, legally) make a backup
of it, for when the inevitable happens and time has rendered your
physical copy unusable. You notice you happen to have used the same
tools the scene group is know to use for producing their rips, and those
tools happen to be known for producing byte-for-byte reproducible
copies. You are aware of MediaMegaCorp's attempt to wipe the rip off
the net, but pay it no mind, as you use LesserAsuran
for your backups,
an archiver that strives to leak nothing about its repository's
contents.
A few days later, MediaMegaCorp approaches your storage provider about
removing copies of the rip from their servers. The storage provider,
fearing retribution in the form of a massive copyright lawsuit, complies
and asks that MediaMegaCorp provides the hash of the file, which they
do, we'll call it #🎞.
Your storage provider, savvy to the existence of such tools as
LesserAsuran
that index encrypted data based on the hash of the
plaintext, not only scans their disks for files whose hashes are #🎞, but
also scans the disk for the value of #🎞 itself. When they stumble upon
your LesserAsuran
repository, and find #🎞 in its index, they don't
need to reverse the hash to know that the film itself is stored in the
repository, as they already have the plaintext the hash is tied to.
Asuran
effectively prevents this attack by using an HMAC instead of a
plain hash function for keying. The use of HMAC ties the output of the
key function to something other than just the plaintext of the object,
in this case, it ties it to secret key material that either never
touches the remote/untrusted storage unencrypted, or never touches it at
all. Even if they have the plaintext they are searching for, the
attacker can not determine if your repository contains that blob
without also having your secret key material.
How using a unique HMAC key poses problems for syncing data between repositories
While using an HMAC to generate the content keys has obvious security
advantages, the fact that it also serves as the key for deduplication
poses an issue to practical use, namely, efficiently synchronizing data
between repositories.
How I would solve the problem if we used a plain hash
If Asuran
used a plain hash instead of an HMAC, synchronizing archives
between repositories would be dead simple. The process would be as
simple as making a list of all the chunk keys in a particular archive on
the local repository, then interrogating the remote repository to see
which chunks it is missing. Once you have that information you could
simple re-encrypt only the missing chunks with the remote's key, as
well as the archive structure itself, and send them over to the remote
repository for direct storage.
Using an HMAC tied to the repository's key material confounds this
approach, as each repository will, presumably have different HMAC keys,
and prevent the efficient detection and sending of only changed/missing
chunks. The local repository would, at a minimum, have to locally
decrypt all chunks and reprocess them with the remote's key to
determine which ones are missing from the remote, which, while a valid
strategy, is not very efficient, epically for the use case of backing up
large file systems with infrequent/small changes between snapshots.
A naive, but flawed approach
One obvious approach would be to allow copying of the entire key
material for a repository into a new one, and only allowing direct sync
between repositories sharing key material, falling back to the
inefficient "just reprocess everything" approach when this is not the
case. While this is certainly a valid approach, and we probably will
support doing this, it will not be the default behavior, as violates
asuran's "leak nothing" policy, by making it trivial to determine if
two repositories contain the same information.
One might think that sharing only the HMAC key would be sufficient, and
keeping the other components of the key different between repositories,
as in this case the encrypted bytes of the chunks themselves would still
be different on disk, and you would still need to know the secret HMAC
key to be able to determine if a repository contains a particular
plaintext.
This approach, however, still leaks that two repositories contain the
same information, in a less than obvious way, so I consider it unsafe.
To demonstrate this attack, lets posit a future where people share cool
stuff they want to archive, but also be available to the public, by
hosting public asuran archives, and sharing the password to it amongs't
trusted members of the community (or even having the public repository
be NoEncryption
), and you might pull these files into your own asuran
archives by a direct pull. Say you are subscribed to a historical
document archiving group that you pull from a lot, so as a matter of
convenience you clone your own personal asuran repository's HMAC key
from that groups public repository. It would probably seem like no big
deal, since your local repository is encrypted with a different
encryption key, so it shouldn't leak anything anyway.
Lets go for a slightly more insidious example than the last one.
Assume you live in a country with a state secrets act that prohibits
civilian possession and distribution of certain pieces of information,
and there has been an illegal photo of your country's new stealth
bomber making the rounds on certain parts of the internet. One of the
maintainers of your archiving group (in my opinion, rightly so) decides
that the photo of the stealth bomber should be preserved, and sees the
best way of doing this as sneaking it into one of their regular archive
uploads to the group repository. None the wiser, you conduct your normal
weekly pull from the public repository, blissfully unaware of the
illegal content that has just been so rudely thrust upon you.
Now lets assume that the original uploader has either been caught in the
act, or confessed to his crime, or the government has found out through
some other means, and that even though the government knows who uploaded
the illegal content and when, they still do not have the key to the
repository. Even though they could not positively identify which chunks
contained the illegal information, the government still use time
stamp information and other side channels to make a definitive statement
beyond a reasonable doubt that if a repository contains all of a
specific set of chunk keys, it contains the illegal information. The
government could then go to storage providers and require that they scan
their disks for the offending sets of chunk ids, and if your repository
contains all of them, then congratulations, you are now, at the very
least on a list.
My proposed solution
Requirements
Based on the above described attacks, any efficient solution to the
problem of synchronizing archives between repositories must satisfy the
following properties:
- Must not leak plain hashes of plaintext
- Must not share any secret key state between repositories
- Must not require any deep inspection of chunks that are shared
between the repositories
- Must require the secret keys of both repositories to determine if
they share any information
- Must not have any non-optional storage overhead
- Must still allow synchronization, even with compute overhead, with
repositories that do not have any special features enabled
My solution
I propose modifying the manifest API such that each archive entry has an
optional pointer to a chunk containing the following struct:
struct IDMap {
known_previously: Vec<ChunkID>,
additional: BiMap<ChunkID, ChunkHash>,
}
Where known_previously
is a vector of pointers to the heads of all
other known IDMap trees at the time of archive creation, and
additional
is a bijective
mapping of HMAC keys to the plain hashes of each chunk that was not
previously known. Since the mapping between ChunkID
and ChunkHash
should be globally bijective, it is trivial to walk the entire tree
and union these together at run time, to construct a complete mapping.
Syncing to a remote repository can be accomplished by either
interrogating the remote repository using ChunkHash
rather than
ChunkID
.
This satisfies each property as follows:
-
Must not leak plain hashes of plaintext
Chunks are encrypted before hitting storage, so the plain hashes are
never written in the clear
-
Must not share any secret key state between repositories
This just straight up is not a requirement here, the ChunkIDs
are
converted to a secret key agnostic format during interrogation.
-
Must not require any deep inspection of chunks that are shared
between the repositories
The bijective map between ChunkID
and ChunkHash
means that
determining if a chunk is present in either repository requires only
a handful of constant time HashMap
lookups on either end.
-
Must require the secret keys of both repositories to determine if
they share information
As the plain hashes themselves are encrypted in storage, an attacker
would only have access to the HMACs, which will still be different
between repositories.
-
Must not have any non-optional storage overhead
As the IDMap
pointer will be optional, it will be perfectly valid
for a repository to just not include this information.
-
Must still allow synchronization between repositories that do not
have special features enabled
This information can still be recovered at run time through deep
chunk inspection, though at the cost of the I/O and compute that
takes.