Our release process was stressful, time-consuming, and error prone.
This was the process: notify the team in Slack of your intention to release so they don't start their own release, notify Merge, tag
master via GitHub Release, copy the GitHub Release url and send paste it into the Customer Relations Slack channel so they know what's going out, find the correct "build" job, click through a few screens, click "build" and wait for it to build (about 15 - 25 minutes), deploy the build to dev, find the correct integration test job trigger it and wait for it to complete (10 - 15 minutes), run any test failures by QA to make sure they're not blockers, QA may not be around so you might have to wait until someone has time...
Are you still with me?
...find the production "deploy" job, click through a few screens, select the correct build artifact from the original job, click submit and wait for the deployment to complete, copy the url of the build log and paste it into the GitHub Release, do a smoke test in production, give an "all clear" message to the team so they know they too can release, if they need to.
That's if everything went well. If something went wrong than then it would expand, and possibly start all over.
This process would get run through up to three or four times per day by various engineers on the team, and it was error prone. We could easily click the wrong build job link, sending the incorrect build artifact into Production – it happened more than once. The process took forever so engineers would spend half of their day babysitting a release instead of working on product features.
And the team would spend a lot of time to coordinating:
The new process: Open a pull request, get peer approval, get QA approval, merge, move on!
Every day at 9AM:
When the build is ready, the day's release marshal is notified and they're given a link to the "Proceed" screen. The process is async, so if they're busy they can leave the paused build until their schedule allows. They can hold the build altogether if there are any risky factors (such as a larger ongoing incident). If everything's in order, they click "Proceed" and get back to work.
Humans are responsible for writing high quality code, reviewing code from their peers, and merging completed work. Automation is responsible for everything else.