Key Takeaways
- Operational and analytical use cases are not able to access relevant, complete, and trustworthy data reliably. There needs to be a new approach to data processing.
- While the multi-hop architecture has been around for decades and can bridge operational and analytical use cases, it’s inefficient, slow, expensive, and difficult to reuse.
- The shift left approach takes the same data processing happening downstream and shifts it left (upstream) so more teams can access relevant, complete, and trustworthy data.
- Data products are a key part of a shift left, forming the basis of data communications across the business.
- Data contracts ensure healthy data products and provide a barrier between the internal and external data models providing data producer users with a stable yet evolvable API and well-defined boundaries into business domains.
Operational and analytical use cases all face the same problem: they are unable to reliably access relevant, complete, and trustworthy data from across their organization. Instead, each use case typically requires cobbling together its own means for accessing data. ETL pipelines may provide a partial solution for data access for data analytics use cases, while a REST API may serve some ad hoc data access requests for operational use cases.
However, each independent solution requires its own implementation and maintenance, resulting in duplicate work, excessive costs, and similar yet slightly different data sets.
There is a better way to make data available to the people and systems who need it, regardless of whether they’re using it for operational, analytical, or something in between. It involves rethinking those archaic yet still commonly-used ETL patterns, the expensive and slow multi-hop data processing architectures, and the “everyone for themselves” mentality prevalent in data access responsibilities. It’s not only a shift in thinking but also a shift in where we do our data processing, who can use it, and how to implement it. In short, it’s a shift left. Take the very same work you’re already doing (or will be doing) downstream, and shift it left (upstream) so that everyone can benefit from it.
But what are we shifting left from?
Rethinking Data Lakes and Warehouses
A data lake is typically a multi-hop architecture, where data is processed and copied multiple times before eventually arriving at some level of quality and organization that can power a specific business use case. Data flows from left to right, beginning with some form of ETL from the source system into a data lake or data warehouse.
Multi-hop architectures have been around for decades as part of bridging the operational-analytical divide. However, they are inherently inefficient, slow, expensive, and difficult to reuse.
The medallion architecture is the most popular form of multi-hop architecture today. It is divided into three different medallion classifications or layers, according to the Olympic Medal standard: bronze, silver, and gold. Each of the three layers represents progressively higher quality, reliability, and guarantees – with bronze being the weakest and gold being the strongest.
- Bronze layer: The bronze layer is the landing zone for raw imported data, often as a mirror of the source data model. Data practitioners then add structure, schemas, enrichment, and filtering to the raw data. The bronze layer is the primary data source for the higher-quality silver layer data sets.
- Silver layer: The silver layer provides filtered, cleaned, structured, standardized, and (re)modeled data, suitable for analytics, reporting, and further advanced computations. These are the building blocks for calculating analytics, building reports, and populating dashboards, and are commonly organized around important business entities. For example, it may contain data sets representing all business customers, sales, receivables, inventory, and customer interactions, each well-formed, schematized, deduplicated, and verified as “trustworthy canonical data”.
- Gold layer: This layer delivers “business-level” and “application-aligned” data sets, purpose-built to provide data for specific applications, use cases, projects, reports, or data export. The gold layer data is predominantly de-normalized and optimized for reads, but you may still need to join against other data sets at query time. While the silver layer provides “building blocks”, the gold layer provides the “building,” built from the blocks and cemented together with additional logic.
The medallion architecture, a popular version of the multi-hop architecture
There are serious problems with the medallion architecture and, indeed, with all multi-hop architectures. Let’s take a look to examine why:
Flaw #1: The consumer is responsible for data access
The multi-hop medallion model is predicated on a pull system. A downstream data practitioner must first create an ETL job to pull data into the medallion architecture. Next, they have to remodel it, clean it, and put it into a usable form before any real work can begin. This is a reactive and untenable position, as the consumer bears all the responsibility of keeping the data available and running, without having ownership or even influence of the source model.
Furthermore, the ETL is coupled to the source data model, leading to a very tight and brittle coupling. Any changes made to the upstream system (such as the database schema) can cause the pipeline to break.
Flaw #2: The medallion architecture is expensive
Populating a bronze layer requires lots of data copying and processing power. Then, we immediately process and copy that data again to get it to a silver layer of quality. Each step incurs costs – loading data, network transfers, writing copies back to disk, and computing resources – and these costs quickly add up to a large bill. It’s inherently expensive to build and maintain, and results in significant wasted resources.
Costs can further balloon when coupled with consumer-centric responsibility for data access. A consumer who is unsure what data they can or can’t use is inclined to build their own pipeline from the source, increasing the total cost of both compute and human resources. The result may be a gold-layer project that fails to provide an adequate return on investment, simply due to the high costs incurred by this pattern.
Flaw #3: Restoring data quality is difficult
Suppose you’re successful at ETLing the data out of your system into your data lake. Now you need to denormalize it, restructure it, standardize it, remodel it, and make sense out of it – without making any mistakes.
A formerly popular TV show in the United States, entitled Crime Scene Investigators (CSI) gave their audi