Published Wednesday, Jan 8, 2025
–
741 words, 4 minutes
Welcome to a new post in my miniseries about some crazy bugs I’ve had to investigate and debug throughout my career.
This story is from about three years ago, at a previous job. I was involved in doing a lift-and-shift migration from on-prem to Google Cloud of a bespoke system we had built for a large insurance company.
The application was a very complicated and interesting distributed system, designed by some former colleagues, that I maintained at the time. It was used to implement part of a custom model for Solvency II compliance, i.e. the EU rules that govern how much money insurances need to keep aside as a liquidity buffer, created after the 2008 financial crisis by the European regulators.
The system was basically a big Montecarlo simulator – it would estimate the value of financial portfolios against a large number of scenarios (hundreds of thousands), do a lot of statistics and very complex aggregation rules, and produce a ton of reports. It processed a pretty big amount of data – a typical run would use some 200 CPUs and 1.5 TB of RAM for roughly twenty minutes to generate dozens of gigabytes of reports if I remem