Jun 21, 2023
Ever since I started using Matrix, I always wondered what would happen if a
Matrix homeserver got deleted and then recreated, without any data on the
database. Would the federated servers complain a lot about it? Would federated
rooms work once the server was recreated and started federating again? I
never had the chance or time to properly investigate this.
Well, that is until the hardware node that hosted my Matrix database died.
How did it come to this?
I’ve been renting a dedicated server from online.net since 2016, using their
cheapest offering. I then spun up multiple KVM based virtual machines via
Terraform for my various usecases, while having everything defined as
Infrastructure as Code in a git repo. This is documented in an older post of
mine alongside my IaC repo. You should check them out ;)
I had setup two VMs for Matrix. One was hosting Element-web and the Synapse
Python implementation, and the other one was hosting the Postgres
database. Everything was defined as code via Ansible and, specifically, via the
DebOps collection of roles.
Initially, everything was backed up one way or another. However, over time, the
Matrix server and its database ended up taking more space than I had planed, so
I made two decisions.
- I stopped backing up the local media files uploaded by the users of my
homeserver. - I setup a replica for the database in a different hardware host in Hetzner
and, at the same time, stopped taking weekly snapshot of the database.
I know that replication isn’t a backup. I’ve setup both wal-e and later wal-g in
my profesional career as proper Postgres point-in-time-recovery backups, but
this was something I wasn’t willing to do yet for my Matrix setup. At least, not
until I had an object storage solution (like min.io) running at home.
Eventually, the Hetzner server became too old to keep paying for it, and I
upgraded to a newer one, but never setup the replication again. It was always in
the back of my mind, but I never got around to it because other things in life
took priority.
At the end of the day, the Matrix server was only used by me, family, friends
and maybe friends-of-friends. In the extreme case that there was data loss, they
would be understanding. I was running this for fun after all, it wasn’t used
professionally in any way.
Server goes boom
One fine evening, the hardware server and the VMs simply stopped responding to
network packets. This wasn’t unheard of, during the last ~7 years I remember two
times that something happened to the datacenter and the server had lost
conectivity, so I figured it was something similar. After 30 minutes of not
seing an update on the status page of online.net, I tried rebooting the server
via their hardware management interface. I got back an error, so I opened a
support ticket with them.
Within a couple of minutes they had replied and the verdict was in. The server
experienced a hardware failure, but the SSD was fine. However, they informed me
they wouldn’t be able to perform any operations on it, because the server chasis
was shared with other customers and getting access to the disk meant they would
have to power off all the servers in the rack. Initially this sounded a bit
weird to me, it was a dedicated server after all. However, thinking more about
it and having seen how compact and custom the DCs