
How Convex Took Down T3 Chat (Postmortem) by ko_pivot
I used to co-host a podcast called The Downtime Project. Each episode, my old friend Tom Kleinpeter and I walked through a public tech postmortem, extracted lessons, and related our own stories about outages of our past projects. Back then, in 2021, Convex was just a prototype. Once or twice, I commented on air that karma would ensure Convex got its own Downtime Project-worthy incident or two one day. Here we are! Let’s dive in.
First, I want to apologize to all the T3 Chat users who were impacted by Sunday’s outage. Despite what Theo’s discount code might lead you to believe, none of this was Theo’s fault. It was Convex’s fault.
The vast majority of Convex is open source, so throughout this post, I’ll refer to our public source code and PRs whenever I can.
This is a complex outage, so bear with me! I hope to tie it all together in the end.
Generally speaking, T3 Chat’s traffic should be manageable by Convex, even at a 10x or 100x higher level. But there are two things about T3 Chat’s use of Convex that are new for us, and led to system states we were underprepared for:
- T3 Chat relies more heavily on text search than other high-traffic Convex customers.
- Users frequently leave T3 Chat open in a background tab, so it’s there when they need it.
These two factors combined led to a new operational pattern on our platform that set the stage for Sunday’s outage.
And perhaps a bit of background on Convex is helpful to cover as well:
- Convex manages its sync protocol over a WebSocket.
- Convex is a reactive backend. Clients can subscribe to TypeScript server “query functions.” When mutation functions change some records in the database, Convex ensures the affected subscriptions are updated by re-running the queries and producing the new results. We call that “query invalidation.”
- As we’ve scaled up customers, we’ve slowly accumulated a vast list of knobs. These are runtime-customizable parameters that we can adjust on a case-by-case basis to make larger customers scale smoothly. They represent resources, cache sizes, limits, etc, within a Convex deployment.
- Convex’s text search system is much less proven to be performant at scale than the rest of our platform. There are performance bottlenecks we’re now just tuning for T3 Chat.
T3 Chat was slow for various periods between 6:30a PT and approximately 9:45a PT.
T3 Chat was essentially unusable between 9:45a and 12:27p.
What we did and what we knew, in real time. All times are PT on Sunday, June 1, 2025.
6-7a: The Convex on-call engineer noticed elevated error rates for some T3 Chat users. The errors subsided, and load returned to normal, so the investigation was tabled until more engineers were online.
~9:30a: The errors returned. With several engineers online, the team realized a new limit within the codebase was being hit as more T3 Chat users were waking up and using the app: the number of pending subscriptions that needed to be refreshed. T3 Chat was periodically having gi