Instagram scaled from 0 to 14 million users in just over a year, from October 2010 to December 2011. They did this with only 3 engineers.
They did this by following 3 key principles and having a reliable tech stack.
-
Keep things very simple.
-
Don’t re-invent the wheel.
-
Use proven, solid technologies when possible.
Early Instagram’s infrastructure ran on AWS, using EC2 with Ubuntu Linux. For reference, EC2 is Amazon’s service that allows developers to rent virtual computers.
To make things easy, and since I like thinking about the user from an engineer’s perspective, let’s go through the life of a user session. (Marked with Session:
)
Session: A user opens the Instagram app.
Instagram initially launched as an iOS app in 2010. Since Swift was released in 2014, we can assume that Instagram was written using Objective-C and a combination of other things like UIKit.
Session: After opening the app, a request to grab the main feed photos is sent to the backend, where it hits Instagram’s load balancer.
Instagram used Amazon’s Elastic Load Balancer. They had 3 NGINX instances that were swapped in and out depending on if they were healthy.
Each request hit the load balancer first before being routed to the actual application server.
Session: The load balancer sends the request to the application server, which holds the logic to process the request correctly.
Instagram’s application server used Django and it was written in Python, with Gunicorn as their WSGI server.
As a refresher, a WSGI (Web Server Gateway Interface) forwards requests from a web server to a web application.
Instagram use Fabric to run commands in parallel on many instances at once. This allows to deploy code in just seconds.
These lived on over 25 Amazon High-CPU Extra-Large machines. Since the server itself is stateless, when they needed to handle more requests, they could add more machines.
Session: The application server sees that the request needs data for the main feed. For this, let’s say it needs:
-
latest relevant photo IDs
-
the actual photos that match those photo IDs
-
user data for those photos.
Session: The application server grabs the latest relevant photo IDs from Postgres.
The application server would pull data from PostgreSQL, which stored most of Instagram’s data, such as users and photo metadata.
The connections between Postgres and Django were pooled using Pgbouncer.
Instagram sharded their data because of the volume they were receiving (over 25 photos and 90 likes a second). They used code to map several thousand ‘logical’ shards to a few physical shards.
An interesting challenge that Instagram faced and solved is generating IDs that could be sorted by time. Their resulting sortable-by-time IDs looked like this:
-
41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
-
13 bits that represent the logical shard ID
-
10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond
(You can read more here.)
Thanks to the sortable-by-time IDs in Postgres, the application server has successfully received the latest relevant photo IDs.
Session: The application server then gets the actual photos that match those photo IDs with fast CDN links so that they load fast for the user.
Several terabytes of photos were stored in Amazon S3. These photos were served to users quickly using Amazon CloudFront.
Session: To get the user data from Postgres, the application server (Django) matches photo IDs to user IDs using Redis.
Instagram used Redis to store a mapping of about 300 million photos to the user ID that created them, in order to know which shard to query when getting photos for the main feed, activity feed, etc. All of Redis was stored in-memory to decrease latency and it was sharded across multiple machines.
With some clever hashing, Instagram was able to store 300 million key mappings in less than 5 GB.
This photoID to user ID key-value mapping was needed in order to know which Postgres shard to query.
Session: Thanks to efficient caching using Memcached,