“Convergent encryption” allows for end-to-end encryption with server-side deduplication across all users’ data.
The first time I heard about this, it sounded impossible. End-to-end encryption means I encrypt the file before sending it anywhere. The idea of deduplication is to only store a single copy of each file, but encrypted files from different users should be unique!
Convergent encryption solves this seemingly impossible problem, but it has some major drawbacks that prevent adoption.
graphic by davidzydd
How to “converge” on a single encrypted file
If you already know how convergent encryption works, you can skip this section. But for everyone else, it’s fun to reinvent it. Here are the high-level goals:
- We want end-to-end encryption, which means each user encrypts their files before uploading them to a shared storage service. Only the user (or users!) who encrypted the file should be able to decrypt it.
- We want server-side deduplication. If two users encrypt the same file, the result of their encryption needs to be byte-for-byte identical. The server can then safely store only one copy of the file.
For multiple users to encrypt a file and arrive at an identical result, they must use the same encryption key. That key, then, has to be derived from something only those users know. So what do users who possess a given file know that other users don’t?
The answer is the content of the file! The trick is to encrypt the file with the hash of its content:
E(k, m)
is a symmetric encryption algorithm such as AES that encrypts message m
with key k
, and H
is a cryptographic hash function like SHA-1.1
Any user who has the message/file m
will end up with the same encrypted version. Those users will have no trouble decrypting the file later