It was a wild year for the database industry, with newcomers overtaking the old guard, vendors fighting over benchmark numbers, and eye-popping funding rounds. We also had to say goodbye to some of our database friends through acquisitions, bankruptcies, or retractions.
As the end of the year draws near, it’s worth reflecting and taking stock as we move into 2022. Here are some of the highlights and a few of my thoughts on what they might mean for the field of databases.
Dominance of PostgreSQL
The conventional wisdom among developers has shifted: PostgreSQL has become the first choice in new applications. It is reliable. It has many features and keeps adding more. In 2010, the PostgreSQL development team switched to a more aggressive release schedule to put out a new major version once per year (H/T Tomas Vondra). And of course PostgreSQL is open-source.
PostgreSQL compatibility is a distinguishing feature for a lot of systems now. Such compatibility is achieved by supporting PostgreSQL’s SQL dialect (DuckDB), wire protocol (QuestDB, HyPer), or the entire front-end (Amazon Aurora, YugaByte, Yellowbrick). The big players have jumped on board. Google announced in October that they added PostgreSQL compatibility in Cloud Spanner. Also in October, Amazon announced the Babelfish feature for converting SQL Server queries into Aurora PostgreSQL.
One measurement of the popularity of a database is the DB-Engine rankings. This ranking is not perfect and the score is somewhat subjective, but it’s a reasonable approximation for the top 10 systems. As of December 2021, the ranking shows that while PostgreSQL remains the fourth most popular database (after Oracle, MySQL, and MSSQL), it reduced the gap with MSSQL in the past year.
Another trend to consider is how often PostgreSQL is mentioned in online communities. This gives another signal for what people are talking about in databases. I downloaded all of the 2021 comments made on the Database Subreddit and counted the frequency of database names (in PostgreSQL of course). I cross-referenced the list of every database that I know about from my Database of Databases, cleaned up abbreviations (e.g., Postgres → PostgreSQL, Mongo → MongoDB, ES → Elasticsearch), and then calculated the top 10 most-mentioned DBMS:
dbms | cnt ---------------+----- PostgreSQL | 656 MySQL | 317 MongoDB | 266 Oracle | 222 SQLite | 213 Redis | 88 Elasticsearch | 70 Snowflake | 52 DGraph | 46 Neo4j | 42
Of course this ranking is not scientific, since I am not doing sentiment analysis on the comments. But it clearly shows that people are mentioning Postgres more than other systems in the past year. There are often posts from developers asking what DBMS to use for their new application, and the response from the community is almost always Postgres.
Andy’s Take:
Foremost, it is a good thing that a relational database system has become the first choice in greenfield applications. This shows the staying power of Ted Codd’s relational model from the 1970s. Second, PostgreSQL is a great database system. Yes, it has known issues and dark corners, as does every DBMS. But with so much attention and energy focused on it, PostgreSQL is only going to get better over the years.
Benchmark Violence
There’s no love lost between different database vendors this year over benchmark results. Vendors trying to show that their system is faster than their competitors’ goes back to the late 1980s. That’s why TPC was set up to provide a non-partisan forum for officiating over comparisons. But as the influence and prevalence of TPC has waned over the last decade, we now find ourselves in a new round of database benchmark wars.
There were three major street battles that heated up this year over benchmark results.
- Databricks vs. Snowflake
Databricks announced that their new Photon SQL engine set a new world record in 100TB TPC-DS. Snowflake fired back, saying its database is 2x faster and that Databricks ran Snowflake incorrectly. Databricks countered, claiming their SQL engine provides superior execution and price performance over Snowflake. - Rockset vs. Apache Druid vs. ClickHouse
ClickHouse came out swinging, saying it nailed cost efficiency when compared to Druid and Rockset. But not so fast: Imply responded with tests on a newer version of Druid and claimed victory. Rockset joined in, saying its performance is was better for real-time analytics than the other two. - ClickHouse vs. TimescaleDB
Smelling blood in the water, tiger-style Timescale joined the fray. They shot out their own benchmarks results and took the opportunity to point out weaknesses in ClickHouse’s