This is one in a series of papers I’m reading from ASPLOS. These paper reviews can be delivered weekly to your inbox, or you can subscribe to the Atom feed. As always, feel free to reach out on Twitter with feedback or suggestions!
Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale
Applications running in datacenter environments require resources to operate, like dynamic random access memorySee reference. . DRAM is both expensiveIn this article cited by the paper, a distinguished engineer from Azure cloud at Microsoft said, “About 50 percent of my server cost is memory.” and in high demandPartially due to the growth of AI applications. . As alternatives to DRAM emerge, new approaches to trading off performance for cost became available to datacenter applications – specifically, the paper mentions Intel Optane, which offers significant cost savingsOne point-in-time description of the cost savings is here.
To use this new resource type while limiting the impact to application performance, the authors proposed, built, and deployed at scale a new system called Transparent Memory Tiering System (TMTS). The approach interacts with Google’s Borg schedulerAt scale, cluster schedules abstract allocation of raw resources to an application – see Borg from Google, Twine from Facebook, and the open source Mesos project (used at some point by Twitter and Apple). to adaptively move an applications in-memory data from high performance memory to lower cost mediums, an approach it calls memory tiering – based on usage, the system “promotes” in-use memory pages into high performance memory, and “demotes” infrequently-used memory to lower/performance mediums.
When deployed at scale, TMTS replaced 25% of DRAM with lower cost solutions, while incurring little performance impact to the vast majority of applications.
What are the paper’s contributions?
The paper makes three main categories of contributions:
- Design and implementation of a memory tiering system.
- A testing methodology for evaluating changes to the system at scale.
- Lessons and evaluation from running the implementation in production.
How does the system work?
System metrics
The paper discusses the key tradeoff the system needs to make between potential cost savings and performance degradation. For example, applications on the critical path of user requests (which the paper calls high importance latency sensitive (HILS)) are more sensitive to latency and performance impact than batch workloads.
Furthermore, for a tiered memory system to deliver on its promise of cost savings, it needs to strike the right balance of using lower performance hardware – if the cluster scheduler doesn’t run jobs on the lower performance hardware, the resources intended to save cost will sit around being unutilized (meaning the potential cost savings will not occur and there are more resources that you’ve paid for). On the other hand, if the scheduler assigns latency-sensitive applications to lower performance hardware, performance will suffer.
Using these two considerations, the paper defines two metrics to measure the success of memory tiering:
- Secondary Tier Residency Ratio (STRR) represents the “fraction of allocated memory residing in tier2 [lower performance memory]”.
- Secondary Tier Access Ratio (STAR) “is the fraction of all memory accesses of an application directed towards pages resident in tier2”. This is a proxy for application performance impact because an application accessing lower tier memory will likely incur higher latency.
In summary, the goal of the system is to maximize usage of cheaper/lower performance memory (represented via STRR) while minimizing negative impact to application performance (via STAR).

System Architecture
The memory tiering system is divided into four levels of abstraction: hardware, kernel, userspace, and the cluster scheduler.

At the bottom of the stack is the underlying hardware, made up of several types of memory devices with different performance and cost profiles.
Immediately above the hardware is the kernel, which abstracts the hardware into different tiers (tier1 for higher performance, tier2 for lower performance) and operates on hardware abstractions like memory pagesFor more context on memory abstractions like pages, I highly recommend Operating Systems: Three Eas