Braintree Payments uses PostgreSQL as its primary datastore. We rely heavily on the data safety and consistency guarantees a traditional relational database offers us, but these guarantees come with certain operational difficulties. To make things even more interesting, we allow zero scheduled functional downtime for our main payments processing services.
Several years ago we published a blog post detailing some of the things we had learned about how to safely run DDL operations without interrupting our production API traffic.
Since that time PostgreSQL has gone through quite a few major upgrade cycles — several of which have added improved support for concurrent DDL. We’ve also further refined our processes. Given how much has changed, we figured it was time for a blog post redux.
In this post we’ll address the following topics:
- Transactional DDL
- Locking
- Table Operations
- Column Operations
- Index Operations
- Constraints
- Enum Types
- Bonus: Library for Ruby on Rails
For all code and database changes, we require that:
- Live code and schemas be forward-compatible with updated code and schemas: this allows us to roll out deploys gradually across a fleet of application servers and database clusters.
- New code and schemas be backward-compatible with live code and schemas: this allows us to roll back any change to the previous version in the event of unexpected errors.
For all DDL operations we require that:
- Any exclusive locks acquired on tables or indexes be held for at most ~2 seconds.
- Rollback strategies do not involve reverting the database schema to its previous version.
PostgreSQL supports transactional DDL. In most cases, you can execute multiple DDL statements inside an explicit database transaction and take an “all or nothing” approach to a set of changes. However, running multiple DDL statements inside a transaction has one serious downside: if you alter multiple objects, you’ll need to acquire exclusive locks on all of those objects in a single transactions. Because locks on multiple tables creates the possibility of deadlock and increases exposure to long waits, we do not combine multiple DDL statements into a single transaction. PostgreSQL will still execute each separate DDL statement transactionally; each statement will be either cleanly applied or fail and the transaction rolled back.
Note: Concurrent index creation is a special case. Postgres disallows executing CREATE INDEX CONCURRENTLY
inside an explicit transaction; instead Postgres itself manages the transactions. If for some reason the index build fails before completion, you may need to drop the index before retrying, though the index will still never be used for regular queries if it did not finish building successfully.
PostgreSQL has many different levels of locking. We’re concerned primarily with the following table-level locks since DDL generally operates at these levels:
ACCESS EXCLUSIVE
: blocks all usage of the locked table.SHARE ROW EXCLUSIVE
: blocks concurrent DDL against and row modification (allowing reads) in the locked table.SHARE UPDATE EXCLUSIVE
: blocks concurrent DDL against the locked table.
Note: “Concurrent DDL” for these purposes includes VACUUM
and ANALYZE
operations.
All DDL operations generally necessitate acquiring one of these locks on the object being manipulated. For example, when you run:
https://gist.github.com/cf70beba4a16e05840396ad129a04d85
PostgreSQL attempts to acquire an ACCESS EXCLUSIVE
lock on the table foos
. Atempting to acquire this lock causes all subsequent queries on this table to queue until the lock is released. In practice your DDL operations can cause other queries to back up for as long as your longest running query takes to execute. Because arbitrarily long queueing of incoming queries is indistinguishable from an outage, we try to avoid any long-running queries in databases supporting our payments processing applications.
But sometimes a query takes longer than you expect. Or maybe you have a few special case queries that you already know will take a long time. PostgreSQL offers some additional runtime configuration options that allow us to guarantee query queueing backpressure doesn’t result in downtime.
Instead of relying on Postgres to lock an object when executing a DDL statement, we acquire the lock explicitly ourselves. This allows us to carefully control the time the queries may be queued. Additionally when we fail to acquire a lock within several seconds, we pause before trying again so that any queued queries can be executed without significantly increasing load. Finally, before we attempt lock acquisition, we check (query here) for any currently long running queries to avoid unnecessarily queueing queries for several seconds when it is unlikely that lock acquisition is going to succeed.
Starting with Postgres 9.3, you adjust the lock_timeout
parameter to control how long Postgres will allow for lock acquisition before returning without acquiring the lock. If you happen to be using 9.2 or earlier (and those are unsupported; you should upgrade!), then you can simulate this behavior by using the statement_timeout
parameter around an explicit LOCK
statement.
In many cases an ACCESS EXCLUSIVE
lock need only be held for a very short period of time, i.e., the amount of time it takes Postgres to update its “catalog” (think metadata) tables. Below we’ll discuss the cases where a lower lock level is sufficient or alternative approaches for avoiding long-held locks that block SELECT/INSERT/UPDATE/DELETE
.
Note: Sometimes holding even an ACCESS EXCLUSIVE
lock for something more than a catalog update (e.g., a full table scan or even rewrite) can be functionally acceptable when the table size is relatively small. We recommend testing your specific use case against realistic data sizes and hardware to see if a particular operation will be “fast enough”. On good hardware with a table easily loaded into memory, a full table scan or rewrite for thousands (possibly even 100s of thousands) of rows may be “fast enough”.
Create table
In general, adding a table is one of the few operations we don’t have to think too hard about since, by definition, the object we’re “modifying” can’t possibly be in use yet. :D
While most of the attributes involved in creating a table do not involve other database objects, including a foreign key in your initial table definition will cause Postgres to acquire a SHARE ROW EXCLUSIVE
lock against the referenced table blocking any concurrent DDL or row modifications. While this lock should be short-lived, it nonetheless requires the same caution as any other operation acquiring such a lock. We prefer to split these into two separate operations: create the table and then add the foreign key.
Drop table
Dropping a table requires an exclusive lock on that table. As long as the table isn’t in current use you can safely drop the table. Before allowing a DROP TABLE ...
to make its way into our production environments we require documentation showing when all references to the table were removed from the codebase. To double check that this is the case you can query PostgreSQL’s table statistics view pg_stat_user_tables
confirming that the returned statistics don’t change over the course of a reasonable length of time.
Rename table
While it’s unsurprising that a table rename requires acquiring an ACCESS EXCLUSIVE
lock on the table, that’s far from our biggest concern. Unless the table is not being read from or written to, it’s very unlikely that your application code could safely handle a table being renamed underneath it.
We avoid table renames almost entirely. But if a rename is an absolute must, then a safe approach might look something like the following:
- Create a new table with the same schema as the old one.
- Backfill the new table with a copy of the data in the old table.
- Use INSERT and UPDATE triggers on the old table to maintain parity in the new table.
- Begin using the new table.
Other approaches involving views and/or RULEs may also be viable depending on the performance characteristics required.
Note: For column constraints (e.g., NOT NULL
) or other constraints (e.g., EXCLUDES
), see Constraints.
Add column
Adding a column to an existing table generally requires holding a short ACCESS EXCLUSIVE
lock on the table while catalog tables are updated. But there are several potential gotchas:
Default values: Introducing a default value at the same time of adding the column will cause the table to be locked while the default value in propogated for all rows in the table. Instead, you should:
- Add the new column (without the default value).
- Set the default value on the column.
- Backfill all existing rows separately.
Note: In the recently release PostgreSQL 11, this is no longer the case for non-volatile default values. Instead adding a new column with a default value only requires updating catalog tables, and any reads of rows without a value for the new column will magically have it “filled in” on the fly.
Not-null constraints: Adding a column with a not-constraint is only possible if there are no existing rows or a DEFAULT
is also provided. If there are no existing rows, then the change is effectively equivalent to a catalog only change. If there are existing rows and you are also specifying a default value, then the same caveats apply as above with respect to default values.
Note: Adding a column will cause all SELECT * FROM ...
style queries referencing the table to begin returning the new column. It is important to ensure that all currently running code safely handles new columns. To avoid this gotcha in our applications we require queries to avoid *
expansion in favor of explicit column references.
Change column type
In the general case changing a column’s type requires holding an exclusive lock on a table while the