Serverless development and feedback loops
With Serverless Dagster Cloud you can develop and deploy Dagster code without setting up either a local development environment or any cloud infrastructure. When you commit a change to GitHub, a GitHub Action builds and deploys your code directly to Dagster Cloud, where you can view and interact with your Dagster objects in the UI.
Initially, we used our standard Docker-based build process for Dagster Cloud Serverless, however, we soon discovered that this makes the edit-deploy-run cycle tediously slow. To speed this up we implemented a system to ship code outside Docker images. This post describes the problems we analyzed, the solution we settled on, and the various trade-offs we made in the process.
The problem with Docker images
When we build Docker images on GitHub and deploy them to Dagster Cloud, each commit takes anywhere from 3 to 5 minutes to show up in the Dagster UI. Serverless developers often make a small change to the code in each iteration, and having to wait upwards of 3 minutes to see the effect of that change gets tiresome very quickly. We analyzed “what happens when you change one line of code and commit” and discovered the following:
As you see above, the two things that take the longest are:
- building a Docker image (60 – 90+ seconds)
- deploying the Docker container (90 seconds)
Let’s look at each of these.
Building Docker images
Some things to note about building a Docker image:
- Docker images are made of multiple layers in a stack, where each layer is built by a subset of the commands in the
Dockerfile
. - Each layer is identified by a hash.
- When uploading images to a registry, only the layers not present in the registry (as identified by the hash) are uploaded.
- Rebuilding images on a GitHub build machine using the GitHub Actions cache pulls all unaffected layers from the cache onto the build machine. Note that if you have a large set of dependencies in your project that don’t change, they will still be copied from the cache to the build machine during the build process.
- Docker builds are not deterministic. If you build an image twice with the same exact same contents, it may produce a different hash each time. (While not directly relevant, we wanted to note this unexpected observation. As a corner case, consider that a freshly built large layer that is identical to a layer already in the registry may still get uploaded as a new layer.)
Launching Docker containers
The main thing to note about launching Docker containers is that we use AWS Fargate and it takes anywhere from 45 to 90 seconds to provision and boot an image. It does not provide any image caching. Launching a new container downloads all layers from the registry onto the provisioned container.
Other constraints
After the Docker image is built and launched, we run the user’s code to extract metadata that is displayed in the UI. This is unavoidable and can take anywhere from a couple of seconds to 30 seconds or more, depending on how the metadata is computed (for example, it could connect to a database to read the schema). This code server remains alive and serves metadata requests until a new version of the code is pushed, which then launches a new container.
One key requirement we have is repeatability: we need to be able to redeploy the exact same code and environment multiple times. Using the Docker image hash as an identifier for the code and environment works very well for this requirement.
Roundup of alternatives
Here are some alternatives we explored and discussed:
- Switch from Fargate to EC2 for faster container launches. This would increase our ops burden, requiring us to pre-provision, monitor and scale our cluster. We would still have the problem of Docker builds being slow.
- Switch to a different Docker build system such as AWS CodeBuild. This would require a lot more implementation work and deeper integration with GitHub. It was unclear if the payoff would be worth it.
- Switch to AWS Lambda with much faster launch times. The Lambda environment comes with its own base images making it harder to customize if needed. It also imposes a 15-minute limit on execution time which would require complicated workarounds for longer-running servers.
- Reuse the long-running code server by building and uploading only the changed code to the same server. The challenge here would be implementing the packaging and runtime mechanism to ensure a reliable and repeatable execution environment. We looked into various ways to package and distribute Python environments,