It is no secret that I am not a fan of logs; I’ve baited (rapala
in Finnish) discussion in our work chat with things like:
If you’re writing log statements, you’re doing it wrong.
This is a pretty incendiary statement, and while there has been some good discussion after, I figured it was time to write down why I think logs are bad, why tracing should be used instead, and how we get from one to the other.
Hopefully, with less clickbait. Step 3 will shock you, though.
Logs vs Traces
First, lets breakdown what I see as the key differences between logging and tracing code. If you want the practical example and want to skip this wall of text, click here. There is also a short Question and Answer at the end.
Log Levels
Log Levels are meaningless. Is a log line debug
, info
, warning
, error
, fatal
, or some other shade in between?
The only time I have seen this well managed was when we had 3 descriptions:
- Info (everything)
- Create A Task for Later (e.g. this can wait for working hours)
- Wake Someone Up (this is on fire).
However, this has issues; a timeout once is “for later” or just info
, but many timeouts might be “wake up”, how do you encode that logic somewhere?
There is no equivalent of this in a trace, so there’s one less thing to worry about.
Messages
Something like found ${count} users in the cache
looks like a good log message. It is clear and has the information you care about. However, when you come to querying your log aggregation system, you need to do a free-text search on a substring of the message or use a wildcard search. Both of these operations are slow, and worse still cannot help you with the negative version of the question “find me operations where the cache wasn’t checked”.
Assuming structured logs are being used, we can at least add the count
as a property to the log message so it can now be filtered on by the log aggregator. At which point, why do you need the log message at all? Why not just have log({ users_in_cache: count })
?
The logging libraries also put more importance on the message than the properties by having it be the first (and only required) argument to the logging function. From a querying and analysis perspective, this is wrong: messages force you into free text searching and reduce the likelihood of fast queries on attributes.
Given this logline:
logger.info("found the user in the cache", userId);
You could reconstruct this as a statement that can be filtered on:
span.addAttributes({
user_in_cache: true,
user_id: userId,
})
Mixed Outputs / Semantics
Nowadays, people configure their HTTP server to write logs to stdout
rather than to a log file. This makes a great deal of sense, as now a separate piece of software can tail the logs and send them off to the aggregator. Your application’s output, however, will likely end up with mixed plaintext and structured logs, mostly down to libraries and frameworks doing their own thing with logging, which is probably not the same as what you are doing (or other libraries are doing.)
The second problem with writing logs to stdout
is that it mixes the semantics of log lines with console output. While both log messages and console output are (probably) just text, they do have different purposes. A piece of feedback might be “Server listening on port 3000. Click here to open the browser”. This is useful to a local user running the app but isn’t valuable in your logs. Not to mention, its plaintext output on the console, rather than structured, so now your log tailer needs to figure out what to do with it.
With OpenTelemetry, you instead configure an exporter
to handle all your traces (and logs and metrics, if you desire.) This is typically sent as OTLP format either directly to a vendor or to an OTEL Collector instance, which can add/remove/modify data and send it to one or multiple places.
Now you are free to write whatever feedback you want to stdout
.
Relationships
Loglines don’t have a causal relationships. At best, structured logs might have some kind of request identifier (such as requestID
or correlationID
) for all lines written during a request. This allows you to find all the log lines for a given ID, but the log lines themselves don’t have a fixed order. Ordering in structured logging relies on the timestamp field, but this is limited to the accuracy and resolution of the time source. It means that it is possible to get lines appearing at the same time when they happened at different times.
Traces come with automatic parent-child relationships, allowing us to see not only all spans in a single request, but what caused each span, and (as we’ll get to in the next point) when they happened.
Tracing also takes these relationships another step further but having a standard set of formats to pass trace and span IDs along to other services, embedded in HTTP headers, message queue metadata, or other locations. Assuming all your other services send their traces to the same system, you can start to visualise who calls your service and what the effects are of you calling other services.
Imagine opening a trace from your service and discovering another team in a downstream service has started tracing their stuff, and you suddenly can see even more information about the request. You didn’t even do anything! You notice that the way your service is structured causes a lot of load on the downstream, and start a conversation to see if there is a way to make it faster/better.
Timings
The only timing data you get on log lines automatically is the timestamp
of when the log line was written. When you want to see how long an operation took, you must start a timer, and then stop the timer, and log the elapsed duration yourself. As this is a manual process, it is often not done, and when it is done, it tends to have inconsistencies across the application, namely timing source (and thus resolution), property name, and format. Is it elapsed
, elapsed_ms
, duration
, or length
? does it contain a number of seconds, milliseconds, nanoseconds, or a timestamp format?
By comparison, traces come with startTime
, finishTime
, and duration
attributes, which are not only guaranteed to be there but are set from the same timing source and are always written in the same format.
Combine this with the relationship attributes, and you can now render timing graphs, allowing for easy visualisation of how long each part of a process takes, what can be parallelised, and what parts depend on other parts.
For example, this is a graph showing how long a CI job took to run, showing all the different steps, their durations, and child steps:
Querying
Querying can have wildly different performance characteristics depending on which log aggregation service you are using. What unifies all these systems is their slowness, which is mostly down to vast amounts of data which needs indexing so that it can be free text searched. Filtering logs by their structured properties is however pretty quick (usually.)
Where querying falls down is trying to find trends in data, and trying to find answers to negative queries. For example, how would you search for “all requests to x endpoint, where users were not found in the cache”? This requires you to group the logs by request ID, then find an entry with a particular url path, then see if it is missing a log line. The same query in a tracing system would be where path = "/some/api" && !user_in_cache
, as the tracing system is already aware of all the spans in a trace, and does the grouping automagically.
Finally, visualising missing data is hard. Take this small example; it is 4 parallel requests to a system, and one of them is missing a log line. Which line is missing?
Timestamp | UserID | Message |
---|---|---|
12:51:27 | 3fcce385be9e | fetched 3rd party preferences |
12:51:27 | 3fcce385be9e | found user in cache |
12:51:27 | 915d273db25c | fetched 3rd party preferences |
12:51:27 | 3fcce385be9e | saved successfully |
12:51:27 | 8507d369d11c | fetched 3rd party preferences |
12:51:27 | c4e71b4a29f2 | fetched 3rd party preferences |
12:51:27 | 915d273db25c | saved successfully |
12:51:27 | c4e71b4a29f2 | found user in cache |
12:51:27 | c4e71b4a29f2 | saved successfully |
12:51:27 | 8507d369d11c | found user in cache |
12:51:27 | 8507d369d11c | saved successfully |
Is it easier to see the one that is different now?
Timestamp | UserID | Fetched | In Cache | Saved |
---|---|---|---|---|
12:51:27 | 3fcce385be9e | true | true | true |
12:51:27 | 915d273db25c | true | false | true |
12:51:27 | 8507d369d11c | true | true | true |
12:51:27 | c4e71b4a29f2 | true | true | true |
Not only is it easy to see at a glance that 915d273db25c
didn’t find the user in the cache, but also how much less space this takes up (both visually and in terms of storage.)
We can also then use this to query further: show me all traces where in_cache != true
, and see what’s different about them.
Evolving logs
So, with all that being