Authors: Frank Ren|Director, Backend Engineering, Xiaohu Li|Manager, Backend Engineering, Devin Thomson| Lead, Backend Engineer, Daniel Geng|Backend Engineer
Special thanks to: Timothy Der |Senior Site Reliability Engineer, for operational and deployment support
In the earliest stages of Tinder’s explosive growth, the engineering team identified that search would be a strong component for supporting real-time recommendations. Since then, it has been an important part of the Tinder recommendations system.
Tinder’s search architecture was quite simple: one Elasticsearch cluster with one index and the default five shards. We operated with this for years, adding more replicas and more powerful nodes as needed. As time went on, the shards grew larger and more replicas were added to keep latency low. We knew that our design was no longer going to hold up to our scaling expectations when we reached a point where we were using a large number of powerful nodes while still seeing high CPU utilization and corresponding high infrastructure costs.
Tinder’s recommendation use cases are location-based, with a maximum distance of 100 miles. When serving a user in California, there is no need to include the users in London. Additionally, index size significantly affects the indexing and search capacity and performance in tests at large scale we found that the performance increases linearly when index size decreases. If we can create more shards bounded by location (or “geosharded”) that would make each sub-index smaller and should increase performance. With this knowledge in mind, the question then became: what’s the best way to do it?
One quick note about the terms: Elasticsearch itself can have multiple data nodes, often referred to as shards. To differentiate in this article, we use “geoshard” to represent sharding we added on top of it, and reserve “shard” for a verb or to refer to a generic shard.
Sharding Approach
Let’s start with the simple case: put all users (globally) in one single search index. For a user who lives in Los Angeles a search query would look up this single index, which has the entire user base in it. The people who live on the East Coast or even in another country would increase the index size, which negatively affects the query performance while providing no value for the user in Los Angeles.
This indicates an avenue for optimization: if we can divide the data in a way that a query would only touch the necessary index that contains the minimum docs that matter to the query, the amount of computation would be orders of magnitude smaller and the query would be much faster.
Luckily for Tinder’s case, queries are geo-bounded and have a limit of 100 miles. This naturally lends itself to a solution based on geography: storing users who are physically near each other in the same shard.
A good sharding approach should ensure the production load of the geoshards are balanced; otherwise, it will have a hot-shard issue. If we can quantify the load of each geoshard (“load score”), the load score values for all the geoshards should be roughly the same. Obviously, if we have too few shards (only 1 shard) or too many shards (1 million shards) for it to be effective, we need to find the right number of shards.
Balance Issue
One simple approach would be to divide the world map into grids by evenly spacing latitude and longitude:
This clearly won’t work well. Some geoshards in the ocean will be empty with no users, while other geoshards in Europe might contain several big cities. The geoshards will be very unbalanced resulting in hot shards when running in production. This world map projection is very skewed near Earth’s poles, the difference of real geographical area covered by a cell between equator and the pole could be a thousand times, so we need to find a better projection.
Load Score
How can we better balance the geoshards? Like in any type of optimization problem, you can’t optimize what you can’t measure.
There are multiple ways to calculate the load:
- Unique user count
- Active user count
- User’s queries count in an hour
- Combination of the above
For simplicity, let’s say we use unique user count: it is simple to calculate, and easy to aggregate (