This post is part of a two-part series, of which this is the second.
In the first post, we went into a bit into the infrastructure that runs OpenTTD, from BaNaNaS to our main website.
In this post, we will explain a bit about the migration we just did to get to this infrastructure.
After over 2 months of work, I am happy to announce we finished (another) infrastructure migration.
Today is the day I removed the last few DNS entries pointing to AWS’s DNS servers, and I am proud to mention that (almost) all traffic is now routed via Cloudflare (and we aren’t even receiving sponsoring to say so).
In this post I want to take you with me why this migration was needed, what the benefits are, and why you possibly care.
But in short summary:
- (Much) smaller monthly bill as AWS charges insane amounts for bandwidth.
- Faster download speeds for you (ranging from the in-game content service to downloading the game from our website).
- Easier maintainability of our infrastructure with thanks to Pulumi.
This will be a bit nerdy, so if you like these kind of things, continue the read!
History
As with every migration post, it starts with a bit of history.
Just to catch you up to a few months ago.
In 2020 we started the migration from our own dedicated hosts to AWS.
Back then, AWS was offering Open Source credits, which meant we didn’t have to worry so much about money.
Additionally, they had everything we needed, and it seemed like a good fit.
This has been a very good call; not only did I personally have far less worries about things like security, stability, etc, we also had the fewest interruption till date.
More than 99.99% uptime over all our services since then.
This means it has been very unlikely you couldn’t download the game, or use the in-game content service, etc.
Even big companies would be very happy with 99.99%, but given we run this with only a handful of people in their free time … most excellent, if you ask me.
The fun thing is, with such uptime, when we got a report of a problem: it is very likely to be a user-problem, and not a backend-problem.
This is a great spot to be in!
Problems on the horizon
But, for more than a year now, there have been some issues that were hard to deal with.
Let’s explain a few in a bit more detail.
AWS’s bandwidth price
The Open Source credits ran out after the first year; and where we hoped (and somewhat assumed) the AWS bandwidth prices would go down over time, they did not.
For those who don’t know, AWS charges 0.09 USD per GB of bandwidth.
Back in ~2000 (yes, 23 years ago), when I worked in datacenters, the cost per 300GB was about 15 euros per month.
So 0.05 euro per GB.
AWS charges almost double that, in 2023.
And although it is understandable that they have some costs, it is a bit absurd.
We already knew this when migrating to AWS, so we made sure to host our most bandwidth hungry subdomain, the in-game content service, on OVH.
On OVH we used (very) cheap VPSes which only job was to serve these binary files.
And OVH charged us ~10 USD per month for this, unrelated to the amount of bandwidth we consumed.
For reference, this is about 6 TB per month.
If we would have hosted this on AWS, it would have cost 540 USD per month.
But, when you have an infrastructure that is splintered over two parts: one part on AWS, one part on VPSes, maintaining it becomes a bit tricky.
And, as it turns out, we never (ever) updated those OVH VPSes, not in the 2.5 years they have been running.
Shame on us.
Anyway, we still racked up a pretty decent bill in bandwidth costs on AWS, despite offloading our biggest subdomain.
On average, we pay around 100 USD a month on bandwidth alone.
Other than the bandwidth-bill, it has to be added: AWS has been great.
Their services have very low downtime, if any at all, and VPSes (called EC2-instances) run for months, even years, without any disruptions.
AWS’s CDK (Infrastructure-as-Code)
When migrating to AWS, we also wanted to have our infrastructure defined as code.
This way it is easier for others to understand what is done, where, and how.
It also avoids human mistakes of pressing the wrong button, etc.
After looking around in 2020, there were three options:
- Ansible
- Terraform
- AWS’s CDK
Ansible isn’t a good fit, for various of (personal) reasons.
Terraform was an option, but it only supported their own HCL back then.
What was nice about AWS’s CDK, that it allowed us to write the infrastructure in a language we knew better: Python.
So we went for AWS’s CDK.
And what a terrible decision this has been.
AWS’s CDK on its own is great.
But the biggest problem is: they make a new release every N weeks, and more often than not, it totally breaks your current project.
And staying behind on an older version also really isn’t a possibility, as slowly things start to act weird.
And forget about using that new shiny thing, of course.
In the beginning I tried to keep up with CDK releases … but more and more often I had to spend weeks (yes, weeks) doing an upgrade.
The whole concept of Infrastructure-as-Code is that you don’t have to spend so much time on this.
But, I had to.
And the longer