Born from the ashes of Stadia, this repository contains tools for synching and
streaming files from Windows to Linux. They are based on Content Defined
Chunking (CDC), in particular
FastCDC,
to split up files into chunks.
History
At Stadia, game developers had access to Linux cloud instances to run games.
Most developers wrote their games on Windows, though. Therefore, they needed a
way to make them available on the remote Linux instance.
As developers had SSH access to those instances, they could use scp
to copy
the game content. However, this was impractical, especially with the shift to
working from home during the pandemic with sub-par internet connections. scp
always copies full files, there is no “delta mode” to copy only the things that
changed, it is slow for many small files, and there is no fast compression.
To help this situation, we developed two tools, cdc_rsync
and cdc_stream
,
which enable developers to quickly iterate on their games without repeatedly
incurring the cost of transmitting dozens of GBs.
CDC RSync
cdc_rsync
is a tool to sync files from a Windows machine to a Linux device,
similar to the standard Linux rsync. It is
basically a copy tool, but optimized for the case where there is already an old
version of the files available in the target directory.
- It quickly skips files if timestamp and file size match.
- It uses fast compression for all data transfer.
- If a file changed, it determines which parts changed and only transfers the
differences.
The remote diffing algorithm is based on CDC. In our tests, it is up to 30x
faster than the one used in rsync
(1500 MB/s vs 50 MB/s).
The following chart shows a comparison of cdc_rsync
and Linux rsync
running
under Cygwin on Windows. The test data consists of 58 development builds
of some game provided to us for evaluation purposes. The builds are 40-45 GB
large. For this experiment, we uploaded the first build, then synced the second
build with each of the two tools and measured the time. For example, syncing
from build 1 to build 2 took 210 seconds with the Cygwin rsync
, but only 75
seconds with cdc_rsync
. The three outliers are probably feature drops from
another development branch, where the delta was much higher. Overall,
cdc_rsync
syncs files about 3 times faster than Cygwin rsync
.
We also ran the experiment with the native Linux rsync
, i.e syncing Linux to
Linux, to rule out issues with Cygwin. Linux rsync
performed on average 35%
worse than Cygwin rsync
, which can be attributed to CPU differences. We did
not include it in the figure because to this, but you can find it
here.
How does it work and why is it faster?
The standard Linux rsync
splits a file into fixed-size chunks of typically
several KB.
If the file is modified in the middle, e.g. by inserting xxxx
after 567
,
this usually means that the modified chunks as well as
all subsequent chunks change.
The standard rsync
algorithm hashes the chunks of the remote “old” file
and sends the hashes to the local device. The local device then figures out
which parts of the “new” file matches known chunks.
This is a simplification. The actual algorithm is more complicated and uses
two hashes, a weak rolling hash and a strong hash, see
here for a great overview. What makes
rsync
relatively slow is the “no match” situation where the rolling hash does
not match any remote hash, and the algorithm has to roll the hash forward and
perform a hash map lookup for each byte. rsync
goes to
great lengths
optimizing lookups.
cdc_rsync
does not use fixed-size chunks, but instead variable-size,
content-defined chunks. That means, chunk boundaries are determined by the
local content of the file, in practice a 64 byte sliding window. For more
details, see
the FastCDC paper
or take a look at our implementation.
If the file is modified in the middle, only the modified
chunks, but not subsequent chunks
change (unless they are less than 64 bytes away from the modifications).
Computing the chunk boundaries is cheap and involves only a left-shift, a memory
lookup, an add
and an and
operation for each input byte. This is cheaper
than the hash map lookup for the standard rsync
algorithm.
Because of this, the cdc_rsync
algorithm is faster than the standard
rsync
. It is also simpler. Since chunk boundaries move along with insertions
or deletions, the task to match local and remote hashes is a trivial set
difference operation. It does not involve a per-byte hash map lookup.
CDC Stream
cdc_stream
is a tool to stream files and directories from a Windows machine to a
Linux device. Conceptually, it is similar to sshfs,
but it is optimized for read speed.
- It caches streamed data on the Linux device.
- If a file is re-read on Linux after it changed on Windows, only the
differences are streamed again. The rest is read from the cache. - Stat operations are very fast since the directory metadata (filenames,
permissions etc.) is provided in a streaming-friendly way.
To efficiently determine which parts of a file changed, the tool uses the same
CDC-based diffing algorithm as cdc_rsync
. Changes to Windows files are almost
immediately reflected on Linux, with a delay of roughly (0.5s + 0.7s x total
size of changed files in GB).
The tool does not support writing files back from Linux to Windows; the Linux
directory is readonly.
The following chart compares times from starting a game to reaching the menu.
In one case, the game is streamed via sshfs
, in the other case we use
cdc_stream
. Overall, we see a 2x to 5x speedup.
Download the precompiled binaries from the
latest release.
We currently provide Linux binaries compiled on
Github’s latest Ubuntu version.
If the binaries work for you, you can skip the following two sections.
Alternatively, the project can be built from source. Some binaries have to be
built on Windows, some on Linux.
Prerequisites
To build the tools from source, the following steps have to be executed on
both Windows and Linux.
- Download and install Bazel from here. See
workflow logs for the
currently used version. - Clone the repository.
git clone https://github.com/google/cdc-file-transfer
- Initialize submodules.
cd cdc-file-transfer git submodule update --init --recursive
Finally, install an SSH client on the Windows device if not present.
The file transfer tools require ssh.exe
and scp.exe
.
Building
The two tools can be built and used independently.
CDC RSync
- Build Linux components
bazel build --config linux --compilation_mode=opt --linkopt=-Wl,--strip-all --copt=-fdata-sections --copt=-ffunction-sections --linkopt=-Wl,--gc-sections //cdc_rsync_server
- Build Windows components
bazel build --config windows --compilation_mode=opt --copt=/GL //cdc_rsync
- Copy the Linux build output file
cdc_rsync_server
from
bazel-bin/cdc_rsync_server
on the Linux system tobazel-bincdc_rsync
on the Windows machine.
CDC Stream
- Build Linux components
bazel build --config linux --compilation_mode=opt --linkopt=-Wl,--strip-all --copt=-fdata-sections --copt=-ffunction-sections --linkopt=-Wl,--gc-sections //cdc_fuse_fs
- Build Windows components
bazel build --config windows --compilation_mode=opt --copt=/GL //cdc_stream
- Copy the Linux build output files
cdc_fuse_fs
andlibfuse.so
from
bazel-bin/cdc_fuse_fs
on the Linux system tobazel-bincdc_stream
on the Windows machine.
Usage
The tools require a setup wher