Troubleshooting Distance

How far away are you from understanding the cause of a problem? And how direct a path can you take to get there?

I’ve spent the last five years working with Kubernetes in one way or another, and it is the best solution for solving distributed computing problems I’ve seen so far.

I think that a large part of this is because of how it makes the reconciliation loops necessary for running a complex system declarative and explicit.

They’re declarative because the system in general asks you to declare the state you want it to end up in, not the series of steps to get there.

They’re explicit, because the declarative pattern forces you to say exactly what you want.

Where I’ve seen Kubernetes not work well is in how hard it can be to troubleshoot.

I think that, in an ideal world, a Kubernetes object should be able to tell you when something has gone wrong, and what is needed to fix it. And for some objects, this is the case, particularly the core objects like Pod, Deployment, Node and so on.

However, as the scope of things managed by Kubernetes has grown, the number of objects required to record the full state of the system has grown too, and it’s become far more common to need to look at more objects to understand what’s happening.

I spent ages trying to understand what made some troubleshooting experiences better than others, because if I can understand that, hopefully I can build things that are easier to troubleshoot.

And then I noticed that “how many times do I have to type kubectl get to troubleshoot this” is a reasonable approximation for “how easy is this to troubleshoot”. So I started calling this “kubectls to understanding”.

This is a bit of a mouthful though, and I eventually came up with the phrase “troubleshooting distance” to describe it.

What is the Troubleshooting Distance?

The “troubleshooting distance” of a system is how many times you have to interact with the system to understand what’s happening when something breaks.

For Kubernetes objects, this is approximated pretty well by how many kubectl commands it takes for you to understand the object’s state.

Aside from the fact that it’s quicker to say, I like that the distance metaphor also leads to ideas about how to make the distance smaller. You could provide more information, so that you don’t need to travel through lots of intermediate steps to get to the end result, you can make tools to pull together information from otherwise disparate sources, or you can make the system smaller (since a smaller system will tend towards smaller troubleshooting distance, because the overall phase space of the system is smaller.)

My initial feeling is that this idea seems like it might be more broadly applicable as well; logs, metrics, and tracing give you some indications of where you are in the system’s phase space, whereas more integrated Observability tools are more like a GPS system that finds where you are with less effort. I’ll think about this one more though.

What does troubleshooting distance mean for Kubernetes objects?

I’ve found that thinking about this when building new Kubernetes objects has really helped to keep the end user of those objects in mind. You can have the most elegant object design in the world, but it will feel terrible to use if the end user can’t figure out why it’s broken.

To kick off this new iteration of a personal blog, I’m planning on running through some of the guidelines I’ve made for myself in the last couple of years, where I’ve spent a large chunk of my working hours building or interacting with Kubernetes CRDs, that are all about keeping the troubleshooting distance low. Keep an eye out for more posts in the “CRD Design” series here.

Hello World

So I decided it was time to have a place to put stuff that’s too long for Twitter.

I would guess I’ll probably be writing things associated with cloud-native open source stuff, mainly.

(Honestly, this post is just so Hugo will have a page to render at all.)