Dan Dean / Thoughts:

Systems > Tools

"Our home-rolled CI is too slow, let's switch to Jenkins!"

But then Jenkins just runs a bunch of shell scripts outside of source control, build artifacts aren't taken advantage of, and a single build runs npm install four to six different times.

"Our users are getting runtime errors – let's adopt Sentry!"

But then nobody ever checks Sentry, nobody is accountable for error metrics, and our users continue to get runtime errors.

"Nobody knows what work is in flight or how long it's going to take to ship a feature. Let's switch to [GitHub Projects/Jira] from [Jira/GitHub Projects]!"

And after the switch teams still can not communicate dependencies or estimate delivery timelines.

"[Payment provider] is always going down, let's switch to [other payment provider]!"

But the new payment provider fails in exciting new ways, and the payment error queue is as bad as it's always been.

We often reach for tools to address the failures we encounter when we should start by examining our systems.

Maybe it isn't that our CI is too slow, but that it is performing the wrong tasks at the wrong time and needs to be broken apart into a system of contextually-appropriate workflows which support team velocity?

Maybe there are so many runtime errors in production because nobody is accountable for those metrics? If a team's success takes into account error metrics, then the system of accountability will change how individuals prioritize their time and select their work, leading to trade-offs which will favor fewer runtime errors in production.

It's easy to go through the motions of agile without understanding how all of the pieces fit together into a system with specific characteristics. Maybe the kanban tool we already have would be more useful if our Project Managers and Lead Engineers went through agile training and actually understood the purpose of all of the rituals, and could see meaning in cumulative flow diagrams?

Maybe our current payment provider is fine, we just have to rethink the system we've built around it to be asynchronous, resilient to failure, and automatically retry instead of calling straight through to the third party? Every third-party provider is going to have downtime, and our systems need to be designed with that as an intrinsic characteristic to be managed locally.

Switching Tools is Expensive 💰💰💰

The idea of shiny new tools without all of our current problems can be enthralling, but switching tools is often counter-productive.

New tools require time and money to adopt (adjacent tools will need refactoring, integrations will need to be migrated without downtime, people will need training, and at the end artifacts of the previous system will almost inevitably persist), and all too often the failing characteristics of the system remain: build tools are still slow, teams continue to ship runtime errors, delivery timelines are a mystery, and somehow the payment error queue is longer than ever.

Requirements → System Design → Appropriate Tools

seattle subway

Image by Seattle Subway

I'm not at all saying that switching tools should be avoided, just that tools alone will not magic away the failings of the systems they operate within.

Transportation systems are a good example: a fleet of sleek self-driving electric cars sounds amazing, but they will never solve traffic congestion or get us closer to vision zero because Teslas and Volts do not address the failings of transportation systems. To address our transportation system's failings we need focus on the transportation system itself, much like Seattle Subway is doing, and then let that system design guide us in our infrastructure investments.

If our organization's product delivery system has teams working on features with dependencies across teams, then our kanban tool needs to support linking epics across teams. If our technology stack cannot guarantee that runtime errors are not shipped to production, then our system for accountability must include monitoring and measuring the rate and frequency of these errors, and evaluating our team's against those metrics as indicators of team and individual success.

We start by understanding our requirements.

Those requirements should lead to a thoughtful system design exploration.

And once we have crafted a system which will support our needs, only then can we select tools which fit into and support that system.