Our release process was stressful, time-consuming, and error prone.
This was the process: notify the team in Slack of your intention to release so nobody else would start their own release, merge your pull request, tag
master via GitHub Release, copy the GitHub Release url and paste it into the Customer Relations Slack channel, find the correct "build" job, click through a few screens, click "build" and wait for it to build (about 15 - 25 minutes), deploy the build to dev, find the correct integration test job, trigger it and wait for it to complete (10 - 15 minutes), run integration test failures by QA to make sure they're not blockers (QA may be busy so you might have to wait until someone has time) ...
Are you still with me?
...find the production "deploy" job, click through a few screens, select the correct build artifact from the original job, click submit and wait for the deployment to complete, copy the url of the build log and paste it into the GitHub Release, do a smoke test in production, give an "all clear" message to the team so they know they start their own release if they need to.
That's if everything went well. If something went wrong then it would expand and possibly start all over.
This process would run up to three or four times per day by various engineers on the team, and it was error prone. We could easily click the wrong build job link, sending the incorrect build artifact into Production – it happened more than once. The process took forever so engineers would spend half of their day babysitting a release instead of working on product features.
And the team would spend a lot of time to coordinating:
The new process: Open a pull request, get peer approval, get QA approval, merge, move on!
Every day at 9AM:
When the build is ready, the day's release marshal is notified and they're given a link to the "Proceed" screen. The process is async, so if they're busy they can leave the paused build until their schedule allows. They can hold the build altogether if there are any risky factors (such as a larger ongoing incident). If everything's in order, they click "Proceed" and get back to work.
Humans are responsible for writing high quality code, reviewing code from their peers, and merging completed work. Automation is responsible for everything else.