Dan Dean / Project:

Building Safe and Stress-Free Automated Releases

Problem

Our release process was stressful, time-consuming, and error prone.

This was the process: notify the team in Slack of your intention to release so nobody else would start their own release, merge your pull request, tag master via GitHub Release, copy the GitHub Release url and paste it into the Customer Relations Slack channel, find the correct "build" job, click through a few screens, click "build" and wait for it to build (about 15 - 25 minutes), deploy the build to dev, find the correct integration test job, trigger it and wait for it to complete (10 - 15 minutes), run integration test failures by QA to make sure they're not blockers (QA may be busy so you might have to wait until someone has time) ...

Are you still with me?

...find the production "deploy" job, click through a few screens, select the correct build artifact from the original job, click submit and wait for the deployment to complete, copy the url of the build log and paste it into the GitHub Release, do a smoke test in production, give an "all clear" message to the team so they know they start their own release if they need to.

That's if everything went well. If something went wrong then it would expand and possibly start all over.

This process would run up to three or four times per day by various engineers on the team, and it was error prone. We could easily click the wrong build job link, sending the incorrect build artifact into Production – it happened more than once. The process took forever so engineers would spend half of their day babysitting a release instead of working on product features.

And the team would spend a lot of time to coordinating:

ci before

Solution

Automate build, testing, and deployment as a side-effect of source control interactions and build timers. Get humans out of the process of releasing so they can focus on building the product and participating in peer review.

The new process: Open a pull request, get peer approval, get QA approval, merge, move on!

Every day at 9AM:

  • automation fires up and looks for anything merged since the previous release
  • release notes are extracted from merged pull requests
  • compiled release notes are published to a GitHub Release
  • the published GitHub release triggers a build which runs through unit tests, linting, and integration tests, pausing just before the build artifact is sent to production

When the build is ready, the day's release marshal is notified and they're given a link to the "Proceed" screen. The process is async, so if they're busy they can leave the paused build until their schedule allows. They can hold the build altogether if there are any risky factors (such as a larger ongoing incident). If everything's in order, they click "Proceed" and get back to work.

Humans are responsible for writing high quality code, reviewing code from their peers, and merging completed work. Automation is responsible for everything else.

Outcome

ci after

  • The release process went from a mountain of manual steps to one (clicking "Proceed")
  • Time to release went down from 45 minutes to 15 minutes, but humans are only involved for about 3 of those minutes
  • Removed all opportunities for user-error
  • Reduced stress
  • Increased engineering productivity
  • Changed how other platforms within the organization approach CI