Observability 2.0 and the Database for It by todsacerdoti

Share This Article

Sed ut perspiciatis unde.

Observability 2.0 is a concept introduced by Charity Majors of Honeycomb, though she later expressed reservations about labeling it as such(follow-up).

Despite its contested naming, Observability 2.0 represents an evolution from the foundational “three pillars” of observability, metrics, logs, and traces, which have dominated the field for nearly a decade. Instead, it emphasizes a single source of truth paradigm as a data foundation of observability. This approach prioritizes high-cardinality, wide-event datasets over traditional siloed telemetry, aiming to address modern system complexity more effectively.

What is Observability 2.0 and Wide Events

For years, observability has relied on the three pillars of metrics, logs, and traces. These pillars spawned countless libraries, tools, and standards—including OpenTelemetry, one of the most successful cloud-native projects, which is built entirely on this paradigm. However, as systems grow in complexity, the limitations of this approach become evident.

The Downsides of Traditional Observability

Data silos: Metrics, logs, and traces are often stored separately, leading to uncorrelated, or even inconsistent, data without meticulous management.
Pre-aggregation trade-offs: Pre-aggregated metrics (counters, summaries, histograms) were originally designed to reduce storage costs and improve performance by sacrificing granularity. However, the rigid structure of time-series data limits the depth of contextual information, forcing teams to generate millions of distinct time-series to capture necessary details. Ironically, this practice now incurs exponentially higher storage and computational costs—directly contradicting the approach’s original purpose.
Unstructured logs: While logs inherently contain structured data, extracting meaning requires intensive parsing, indexing, and computational effort.
Static instrumentation: Tools rely on predefined queries and thresholds, limiting detection to ‘known knowns’. Adapting observability requires code changes, forcing it to align with slow software development cycles.
Redundant data: Identical information is duplicated across metrics, logs, and traces, wasting storage and increasing overhead.

Wide Events: The Approach of Observability 2.0

Observability 2.0 addresses these issues by adopting wide events as its foundational data structure. Instead of precomputing metrics or structuring logs upfront, it preserves raw, high-fidelity event data as the single source of truth. This allows teams to perform exploratory analysis retroactively, deriving metrics, logs, and traces dynamically from the original dataset.

Boris Tane, in his article Observability Wide Event 101, defines a wide event as a context-rich, high-dimensional, and high-cardinality record. For example, a single wide event might include:

json

{
  "method": "POST",
  "path": "/articles",
  "service": "articles",
  "outcome": "ok",
  "status_code": 201,
  "duration": 268,
  "requestId": "8bfdf7ecdd485694",
  "timestamp":"2024-09-08 06:14:05.680",
  "message": "Article created",
  "commit_hash": "690de31f245eb4f2160643e0dbb5304179a1cdd3",
  "user": {
    "id": "fdc4ddd4-8b30-4ee9-83aa-abd2e59e9603",
    "activated": true,
    "subscription": {
      "id": "1aeb233c-1572-4f54-bd10-837c7d34b2d3",
      "trial": true,
      "plan": "free",
      "expiration": "2024-09-16 14:16:37.980",
      "created": "2024-08-16 14:16:37.980",
      "updated": "2024-08-16 14:16:37.980"
    },
    "created": "2024-08-16 14:16:37.980",
    "updated": "2024-08-16 14:16:37.980"
  },
  "article": {
    "id": "f8d4d21c-f1fd-48b9-a4ce-285c263170cc",
    "title": "Test Blog Post",
    "ownerId": "fdc4ddd4-8b30-4ee9-83aa-abd2e59e9603",
    "published": false,
    "created": "2024-09-08 06:14:05.460",
    "updated": "2024-09-08 06:14:05.460"
  },
  "db": {
    "query": "INSERT INTO articles (id, title, content, owner_id, published, created, updated) VALUES ($1, $2, $3, $4, $5, $6, $7);",
    "parameters": {
      "$1": "f8d4d21c-f1fd-48b9-a4ce-285c263170cc",
      "$2": "Test Blog Post",
      "$3": "******",
      "$4": "fdc4ddd4-8b30-4ee9-83aa-abd2e59e9603",
      "$5": false,
      "$6": "2024-09-08 06:14:05.460",
      "$7": "2024-09-08 06:14:05.460"
    }
  },
  "cache": {
    "operation": "write",
    "key": "f8d4d21c-f1fd-48b9-a4ce-285c263170cc",
    "value": "{"article":{"id":"f8d4d21c-f1fd-48b9-a4ce-285c263170cc","title":"Test Blog Post"..."
  },
  "headers": {
    "accept-encoding": "gzip, br",
    "cf-connecting-ip": "*****",
    "connection": "Keep-Alive",
    "content-length": "1963",
    "content-type": "application/json",
    "host": "website.com",
    "url": "https://website.com/articles",
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
    "Authorization": "********",
    "x-forwarded-proto": "https",
    "x-real-ip": "******"
  }
}

Wide events contain significantly more contextual data than traditional structured logs, capturing comprehensive application state details. When stored in an observability data store, these events serve as a raw dataset from which teams can compute any conventional metric post hoc. For instance:

QPS (queries per second) for a specific API path.
Response rate distributions by HTTP status code.
Error rates filtered by user region or device type.

This process requires no code changes—metric are derived directly from the raw event data through queries, eliminating the need for pre-aggregation or prior instrumentation.

For practical implementations, see:

Challenges of Observability 2.0 Adoption

Traditional metrics and logs were designed to prioritize resource efficiency: minimizing compute and storage costs. For example, Prometheus employs a semi-key-value data model to optimize time-series storage, akin to the early NoSQL era: just as developers moved relational database workloads to Redis (counters, sorted sets, lists) for speed and simplicity, observability tools adopted pre-aggregated metrics and logs to reduce overhead.

However, like the shift to Big Data in software engineering, the move from the “three pillars” to wide events reflects a growing need for raw, granular data over precomputed summaries. This transition introduces key challenges:

Event generation: Lack of mature frameworks to instrument applications and emit standardized, context-rich wide events.
Data transport: Efficiently streaming high-volume event data without bottlenecks or latency.
Cost-effective storage: Storing terabytes of raw, high-cardinality data affordably while retaining query performance.
Query flexibility: Enabling ad-hoc analysis across arbitrary dimensions (e.g., user attributes, request paths) without predefining schemas.
Tooling integration: Leveraging existing tools (e.g., dashboards, alerts) by derivin

Post Author

teleforce

Posted April 25, 2025 at 4:03 am

> We believe raw data based approach will transform how we use observability data and extract value from it.

Perhaps we need to have generic database framework that properly and seamlessly cater for both raw and cooked (processed) for observability something similar to D4M [1].

[1] D4M: Dynamic Distributed Dimensional Data Model:

https://www.mit.edu/~kepner/D4M/

0Likes Log in to Reply
Post Author

zaptheimpaler

Posted April 25, 2025 at 4:43 am

At my company we seem to have moved a little in the opposite direction of observability 2.0. We moved away from the paid observability tools to something built on OSS with the usual split between metrics, logs and traces. It seems to be mostly for cost reasons. The sheer amount of observability data you can collect in wide events grows incredibly fast and most of it ends up never being read. It sucks but I imagine most companies do the same over time?

0Likes Log in to Reply
Post Author

awoimbee

Posted April 25, 2025 at 6:13 am

It looks like what the grafana stack does but it's linking specialized tools instead of building one big tool (eg linking traces [0]).

The only thing then is that there is no link between logs and metrics, but I guess since they created alloy [1] they could make it so logs and metrics labels match, so we could select/see both at once ?

Oh ok here's a blog post from 2020 saying exactly this: https://grafana.com/blog/2020/03/31/how-to-successfully-corr…

[0]: https://grafana.com/docs/grafana/latest/datasources/tempo/tr…
[1]: https://grafana.com/docs/alloy/latest/

0Likes Log in to Reply
Post Author

jillesvangurp

Posted April 25, 2025 at 6:48 am

Opensearch and Elasticsearch do most/all of what this proposes. And then some.

The mistake many teams make is to worry about storage but not querying. Storing data is the easy part. Querying is the hard part. Some columnar data format stored in S3 doesn't solve querying. You need to have some system that loads all those files, creates indices or performs some map reduce logic to get answers out of those files. If you get this wrong, stuff gets really expensive and costly quickly.

What you indeed want is a database (probably a columnar one) that provides fast access and that can query across your data efficiently at scale. That's not observability 2.0 but observability 101. Without that, you have no observability. You just have a lot of data that is hard to query and that provides no observability unless you somehow manage solve that. Yahoo figured that out 20 years or so ago when they created hadoop, hdfs, and all the rest.

The article is right to call out the fragmented landscape here. Many products only provide partial/simplistic solutions and they don't integrate well with each other.

I started out doing some of this stuff more than 10 years ago using Elasticsearch and Kibana. Grafana was a fork that hadn't happened yet. This combination is still a good solution for logging, metrics, and traces. These days, Opensearch (the Elasticsearch fork) is a good alternative. Basically the blob of json used in the article with a nice mapping would work fine in either. That's more or less what I did around 2014.

Create a data stream, define some life cycle policies (data retention, rollups, archive/delete, etc.), and start sending data. Both Opensearch and Elasticsearch have stateless versions now that store in S3 (or similar bucket based storage). Exactly like the article proposes. I'd recommend going with Elasticsearch. It's a bit richer in features. But Opensearch will do the job.

This is not the only solution in this space but it works well enough.

0Likes Log in to Reply
Post Author

arkh

Posted April 25, 2025 at 7:02 am

After reading this post I'm left wondering: you want to capture events. You want to have different views of them. Why don't you use Kafka and create a consumer per "view"?

0Likes Log in to Reply
Post Author

QuiCasseRien

Posted April 25, 2025 at 7:09 am

just one word : uptrace, https://uptrace.dev/

a very satisfied user : trace, metrics, log in a perfect way

0Likes Log in to Reply
Post Author

sunng

Posted April 25, 2025 at 7:22 am

Author here. Thanks @todsacerdoti for posting this.

I am big fan of the idea to have original data and context as much as possible. With previous metrics system, we lost too much information by pre-aggregation and eventually run into the high-cardinality metrics issue by overwhelming the labels. For those teams own hundreds of millions to billions time series, this o11y 2.0/wide event approach is really worth it. And we are determined to build an open-source database that can deal with challenges of wide events for users from small team or large organization.

Of course, database is not the only issue. We need full tooling from instrument to data transport. We already have opentelemetry-arrow project for larger scale transmission that may work for wide events. We will continue to work in this ecosystem.

0Likes Log in to Reply
Post Author

fuzzy2

Posted April 25, 2025 at 7:30 am

This article leaves me confused. The “wide event” example presented is a mishmash of all the different concerns involved with a business operation: HTTP request, SQL query, business objects, caches, …. How is this any better than collecting most of this information as separate events on a technical level (with minimal, if any, code changes: interceptors, middleware etc) and then aggregating afterwards?

From my perspective, this is just structured logging. It doesn’t cover tracing and metrics, at all.

> This process requires no code changes—metric are derived directly from the raw event data through queries, eliminating the need for pre-aggregation or prior instrumentation.

“requires no code changes”? Well certainly, because by the time you send events like that your code has already bent over backwards to enable them.

Surely I must be missing something.

0Likes Log in to Reply
Post Author

Drahflow

Posted April 25, 2025 at 8:24 am

The point that the trinity of logs, metrics and traces wastes a lot of engineering effort to pre-select the right metrics (and labels) and storage (by having too many information triplicate), is a good one.

> We believe raw data based approach will transform how we use observability data and extract value from it.
Yep. We have built quuxLogging on the same premise, but with more emphasis on "raw": Instead of parsing events (wide or not), we treat it fundamentally as a very large set of (usually text) lines and optimized hard on the querying-lots-of-text part. Basically a horizontally scaled (extremely fast) regex engine with data aggregation support.

Having a decent way to get metrics from logs ad-hoc completely solves the metric cardinality explosion.

0Likes Log in to Reply
Post Author

rlupi

Posted April 25, 2025 at 9:09 am

Almost 8 years ago, when I was working as a Monitoring SRE at Google, I wrote a proposal to use compressed sensing to reduce capture, storage and transmission costs from linear to logarithmic. (The proposal is also available publicly, as a defensive publication, after lawyers complicated it beyond recognition https://www.tdcommons.org/dpubs_series/954/)

I believe it should be possible now, with AI, to train online tiny models of how systems behave in production and then ship those those models to the edge to use to compress wide-event and metrics data. Capturing higher-level behavior can also be very powerful for anomaly and outlier detection.

For systems that can afford the compute cost (I/O or network bound), this approach may be useful.

This approach should work particularly well for mobile observability.

0Likes Log in to Reply

Observability 2.0 and the Database for It by todsacerdoti

Observability 2.0 and the Database for It by todsacerdoti

Share This Article

Newsletter

What is Observability 2.0 and Wide Events

The Downsides of Traditional Observability

Wide Events: The Approach of Observability 2.0

Challenges of Observability 2.0 Adoption

HackTech

10 Comments

teleforce

zaptheimpaler

awoimbee

jillesvangurp

arkh

QuiCasseRien

sunng

fuzzy2

Drahflow

rlupi

Leave a comment Cancel reply

Editor's Choice

Observability 2.0 and the Database for It by todsacerdoti

Observability 2.0 and the Database for It by todsacerdoti

Share This Article

Newsletter

What is Observability 2.0 and Wide Events ​

The Downsides of Traditional Observability ​

Wide Events: The Approach of Observability 2.0 ​

Challenges of Observability 2.0 Adoption ​

10 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter

What is Observability 2.0 and Wide Events

The Downsides of Traditional Observability

Wide Events: The Approach of Observability 2.0

Challenges of Observability 2.0 Adoption