This post was co-written with Sandeep Koppula, Senior Manager, Software Engineering, Rahul Gupta, Senior Software Engineer at Mindbody & Mukesh Agrawal, Database Specialist, Sr. Solutions Architect at AWS.
Mindbody is the leading cloud-based technology platform for the fitness and wellness industry, empowering businesses to grow and thrive. Through innovative solutions such as client booking, scheduling, integrated payments, marketing, and analytics, Mindbody streamlines operations and enhances customer engagement. Thousands of businesses across the fitness and wellness space rely on Mindbody for all-in-one management; millions of consumers book experiences with those businesses through the Mindbody application. By using cutting-edge cloud technology, Mindbody provides scalability, reliability, and continuous innovation.
Amazon Aurora PostgreSQL Optimized Reads is a performance enhancement feature specifically designed for Amazon Aurora PostgreSQL Compatible Edition. It focuses on improving the efficiency of read operations by reducing the latency and increasing the throughput for queries. To learn more about this feature and use case, refer to Amazon Aurora Optimized Reads for Aurora PostgreSQL with up to 8x query latency improvement for I/O-intensive applications.
In this post, we highlight the scaling and performance challenges Mindbody was facing due to an increase in their data growth. We also present the root cause analysis and recommendations for adopting to Aurora Optimized Reads, outlining the steps taken to address these issues. Finally, we discuss the benefits Mindbody realized from implementing these changes, including enhanced query performance, significant cost savings, and improved price predictability.
“Upgrading to Aurora Optimized Reads brought meaningful advances to our operations. Despite minor adjustments needed during the transition, Amazon’s solutions and recommendations facilitated a smooth and low-friction migration. This shift has notably streamlined our database performance, cutting down query latency and allowing us to deliver quicker, more responsive service to our customers. The efficiency gains are coupled with cost savings, which allows us to reinvest in enhancing the customer experience and reinforcing our platform’s reliability.”
— Jacob Meacham, CTO, at Mindbody
Current architecture
Mindbody’s email marketing platform operates on an Aurora PostgreSQL cluster to support their business needs. The database version is 13.8, approximately 17 TB in size, and includes partitioned tables with billions of rows. To handle peak workloads, Mindbody uses the largest Aurora instances, and the workload is 80% read and 20% write.
The following diagram illustrates the current architecture.
Scaling and query performance challenges
Mindbody’s Marketing Suite application faced significant scaling and performance challenges due to architectural constraints and growing data demands. Despite support in Aurora for read scaling with up to 15 read replicas, their legacy version of application stack (rails) lacks read-write splitting, directing all workloads to the writer node while using readers solely as failover targets. Complex dynamic SQL queries, involving joins across billions of rows in partitioned tables, further complicate query optimization efforts.
The absence of an integrated data warehouse forces reliance on the online transaction processing (OLTP) database, adding strain to the Aurora writer instance. With vertical scaling maxed out at db.r6i.32xlarge, the application suffers from slower query execution, extended response times, and limited scalability.
Due to these challenges, Mindbody sought guidance on how to further scale up the cluster to improve query performance—without undergoing the extensive process of query tuning, application rewrites, or database refactoring involving sharding.
Root cause analysis
After reviewing performance data in Amazon CloudWatch and Amazon RDS Performance Insights, along with cost analysis from AWS Cost Explorer, we uncovered several key findings.
Firstly, CloudWatch shows that the average BufferCacheHitRatio
consistently remains below 80%. A healthy ratio is typically above 95%, and anything lower suggests that queries are frequently accessing disk rather than being served from the cache.
This can be verified in performance insights by looking at the top wait event IO:DataFileRead
. This event occurs when a connection waits on a backend process to read a required page from storage because the page isn’t available in shared memory. When the needed data isn’t in memory, Aurora fetches it from storage, increasing load on DB instance CPU and network utilization, causing higher latencies for queries, and incurring I/O costs. To mitigate the impact of n