September 22, 2023

What We Talk About When We Talk About Maintenance

Building software creates a need to maintain that software. This is true regardless of the user base. Chances are: if it is being used, it will need to be maintained. This has become especially true in the day-and-age of containerized, dependency heavy applications that, even completely static functionally, will likely benefit from frequent dependency updates to avoid critical security vulnerabilities. In my professional career, I’ve heard many names given to engineer capacity that is assigned to performing these tasks, but the most common I’ve encountered is Keeping The Lights On or KTLO maintenance.

Defining The Work

In a good engineering culture, KTLO would only consist of work that doesn’t involve adding new functionality to an application. This might seem simple at first: If all the work does is maintain the current functionality, then it’s considered KTLO. But not all engineering cultures are created equal and not everyone is understanding of this. What follows is my experience of the varieties and corruption of the KTLO maintenance experience.

Basic Functionality As KTLO

Corruption Level: Uncorrupted

This is KTLO work that actually keeps the lights on; work that ensures that the functionality of software is consistent. In cloud-based environments, KTLO should consist of:

A common pitfall I’ve come across in engineering cultures is a misunderstanding of what is entailed in this type of KTLO work. Dependency version bumping can and will break functionality of your software when not properly tested. A bad engineering culture will likely not have proper automated end-to-end/functionality testing in place, and having a minor version bump of a dependency consume an appreciable amount of developer capacity will confuse management. Optimally, automation will replace the need for anyone to manually validate a minor dependency bump within a software stack, but, in lieu of that, management needs to understand that manual validation takes time.

Enhancement As KTLO

Corruption Level: Mildly Corrupt

Consider the following scenario:

You manage backend services for a shipping that allow the retrieval of customer data related to shipping addresses. This consists of an API sitting on top of a database that stores customer address data. In the past, internal customer service representatives have requested the ability to mark a shipping address as “blocked” or inactive, preventing shipments.

Requirement gathering revealed that the customer service department expects to block roughly 100 addresses/week. Your team built a simple application that allows customer service reps to send these requests by uploading a file containing customer information to an endpoint. A year later, the customer service department has need to suppress 100,000+ addresses. They upload the file using your tool and find this breaks the process. They submit a request to have the expected functionality of this tool restored.

How do you classify this work?

From an engineer’s perspective, this seems like an obvious instance of an enhancement: the requirements for the software have changed. However, in some engineering culture, this work can quickly be considered KTLO. My professional experience has revealed that this is a significant differentiator between positive and negative engineering culture.

Is the business understanding that an increase in usage by an order of magnitude is not always as easy as increasing the size of the input?

There are many engineering/business cultures that would simply consider this KTLO. After all: the functionality isn’t changing, there’s just more work expected. Management that makes this classification will likely find themselves concerned with the amount of “KTLO” that is consuming a team’s capacity.

Recovery and Resiliency Exercises As KTLO

Corruption Level: Moderately Corrupt

This one seems a bit confusing at first.

Generally speaking, KTLO is considered high priority, low cognitive load work. This is not to say that KTLO is something that can just be performed passively or without thought. Instead, it means that KTLO should not be difficult work to perform. If you find that developers are consistently having meetings to discuss and solve issues that are simply “keeping the lights on”, it might be time to re-examine your processes.

That brings us to application recovery and resiliency. For large businesses that have SLAs demanding 99.(insert remarkable amount of 9s here) uptime requirements, this is considered essential.

Resiliency and recovery for incidents is a notoriously difficult problem to solve

Github blog showing incident report for database issues
Even world-scale companies like github haven't completely solved this yet

Huge companies struggle with it. Small companies struggle with it. A bad engineering and management culture exacerbates the issue. I’ve been on teams where the expectation was that recovery exercises and resiliency efforts were considered part of KTLO maintenance. Without fail, these teams would be questioned on the metrics surrounding the amount of capacity being consumed by these exercises.

Having your application not crash into flames during a server disruption should be considered a feature. Engineering management teams should want their developers laser focused on this functionality. To create an incentive to not focus on this work leads to applications prone to critical failure.

Ad-Hoc Requests As KTLO

Corruption Level: Dangerously Corrupt

I once worked on an engineering team managing backend APIs for internal data. Our ‘customers’ were other developers inside the company. I received a request from the product owner of a customer team asking that we make a significant feature change to our service. I informed them that we would be happy to complete the work in an upcoming sprint and would happily schedule a meeting for requirement gathering. I also informed them that, if the work was considered critical, we could work with business management to prioritize this work in the current sprint.

A day later, I received notice that this work should be considered KTLO. According to the management I discussed the issue with: this was blocking potential business, meaning the lights were off, and this work needed to be completed this sprint in addition to the assigned work.

This is exactly how a team ends up with poorly developed applications and consistent overrun of deadlines. Considering ad-hoc requests as part of KTLO defeats the purpose of categorizing the work altogether.

My team and I would eventually pull together and complete all the work requested for the sprint, but the customer ended up deciding they didn’t need the new feature we had developed after all…

Managing KTLO: Dos and Don’ts

I have worked on engineering teams that have seen significant reductions in KLTO work without impacting application performance, development velocity, or causing other disruptions. Here is what worked and what didn’t:

Do: Start discussions about KTLO at the developer level, preferably the most junior level

Newer developers that have recently had to understand what work goes into maintaining your software stack are likely to have the most insight into what is most time-consuming aspects of maintaining your systems. They can tell you what was unintuitive and difficult to onboard with.

More senior developers can describe what can be automated and what major roadblocks appear when managing your software stacks. They will know what work can be done to reduce overhead.

Don’t: Assume KTLO will be reduced once we build the new system

I’ve been on engineering teams where entirely new systems were conceived, designed, and built with a major goal being the reduction of KTLO. The problem is: it’s nearly impossible to predict what KTLO work will look like in the design of a new system. It’s likely that the KTLO that you’re struggling with on your existing system was not part of the original design plan. Consider other options before resorting to redesigns.

Do: Test, Test, Test

I am a strong proponent of the idea that consistent end-to-end testing is an exceptionally effective way to reduce the quantity of KTLO. Having an automated testing suite that can easily communicate to developers that your application functions as intended in a minimal amount of time is invaluable to validating the small changes that KTLO work entails

Don’t: Assume SaaS will be a cure all

I’ve been part of two major migrations to cloud provider managed database systems with the intent of “reducing the KTLO work associated with managing it ourselves”. I’ll admit: cloud-provider managed services do have a lot of great upsides and should definitely be part of your considerations when designing applications. But I have also seen them break in ways that require escalation and support. KTLO that is self-managed is often much cheaper than high-level AWS support for a DB cluster issue.

Do: Realize it takes time

Due to the nature of KTLO work, it often gets lost that reducing it can be a major undertaking. Communicate with your developers about how much time is needed to make these reductions, and consider if it is worth the tradeoff for you.