Anyone who has maintained software for a while knows that it tends to rot over time. It takes deliberate effort to prevent that from happening. In this post I will talk about a story how one team successfully dealt with it and conclude with some practical tips.
The phenomenon known as bit rot or software entropy has several symptoms:
-
Decreasing MTBF (mean time between failure): the software fails more often and there are increasingly more incidents.
-
Increasing LT (lead time): for features that have similar user value, the time it takes for implementation, review, deploy and release increases over time.
-
Decreased efficiency: the ratio of value divided by effort, drops
-
Increasing TTR (time to repair or remedy): it takes longer to fix software defect (repair) and ensuring it does not happen again (remedy). (see my article on InfoQ about MTT metrics)
-
Increasing TTFC (time to first commit): one of several metrics that aim to measure the effectiveness of onboarding a new person to the codebase.
The root causes are generally:
-
External: the runtime, operating system, dependencies, change over time and require the owners to adapt.
-
Internal: bugs, config drift, tech debt
-
Hybrid: The requirements and user demands change faster than the team can satisfy it with the code in hand
Of all those causes, tech debt is one that a development team can control.
I will not bore you with what is already on the internet on the topic:
-
Martin Fowler named 4 types of tech debt
- wrote about the pragmatic middle ground back in 2020
-
David Pereira wrote about the PM’s take on it
-
Devopedia has a page about the history of the term
-
Wikipedia has a list of causes
Instead, I will tell a story of one of the most successful ways to deal with it and conclude with some practical tips.
Years ago, I was collaborating with a team of 12 engineers behind two large full stack applications. Each app had +180K SLOC (source lines of code excluding dependencies, comments and empty lines but including the tests).
The code itself was the result of platformization of a bespoke solution that was built a few years back. At some point, the company had multiple solutions for solving the same problem. So, it reasonably decided to pick the most mature solution, generalize it to a platform, and assemble of a team of A-players to own it.
This gave birth to the inner source monoliths (AKA shared repos) where some 150+ people collaborated.
That’s where I came in. I was an internal transfer from another cluster. From the TDP trio (tech, domain, people), I was familiar with the tech and people but relatively new to the domain.
My struggles started from day one. I could not make sense of the code base and felt frustrated. At the time I had 19 years of programming experience, the last seven of which was specifically on the technology those apps were using.
Almost everyone in the team had less experience than me (at least on paper) yet a simple task would take me multiple days longer than I thought. Yet, I felt dumb and helpless.
Fortunately, some of the creators of the original codebase were with the platform team and could give me a grand total of 2 hours intro. More than helping me understand the code, the intro helped me understand the history, mentality and the larger forces at play which shaped the code.
You see, the leadership did not care about the code quality as long as the stories were delivered on time. Corners were cut, tests were skipped, and I kid you not, there was a sign on the wall that read:
I kept my feelings to myself. Obviously, the guy who asked me to join the team (one of the senior directors in that cluster) had other plans. Maybe it was a test to see how I would react? I was new to the team and had to build credibility before I could steer any change. Plus, as I often say: “Understand before trying to change.” For all I knew, the code and people are inseparable. You cannot fix cultural issues with technical solutions.
My first real contribution to the team was to put this witty picture on the wall. It was received well.
If it were today, I would put this on the wall:
Turns out I was not alone in my frustration. Tech debt consistently kept coming up retro after retro until the management decided to take it seriously and do something about it.
So, we had a workshop to drill deeper into this issue: understand why it is happening and how can we take control. The team had an honest conversation and my respect for the team grew. Turns out, their stressful days did not leave much time for cleaning up the mess. Who would have thought? To their credit, I came in when the code was like a crumbling Jenga tower:
They were pretty aware of the issue, but the main problems were lack of time and knowledge of best practices.
I am paraphrasing here since it was a few years ago but if I recall correctly, one developer said:
There is so much tech debt that we should park all regular activities and go fix that for six months.
PM argued:
But we can’t do that. Who is going to run the product add new features while we’re paying the tech debt? How about breaking the work into smaller