One Amazon Employee’s Mistake Caused Massive Server Outage
Amazon has identified that a mistake by one of its employees caused Wednesday’s widespread outage of some of Amazon Web Services’ (AWS) S3 system.
The outage affected popular websites and services including Slack, Medium, Quora and a wide range of other sites, and Amazon itself was unable to report the service status through the AWS Service Health Dashboard for two hours due to a dependency on Amazon S3.
Amazon said its S3 team was debugging a billing system issue when one team member incorrectly entered a command with the intention of only removing a small number of servers. Instead, “a larger set of servers was removed than intended.”
Due to the “massive growth” of the S3 system in recent years, the process of restarting and validating systems took longer than expected. S3 eventually began operating normally more than four hours after the initial mistake.
The company said that even though the removal of capacity was key to its operations, the tool used by the employee led to too much capacity being removed too quickly.
“We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level,” Amazon said in a statement.
Amazon also said it was prioritising work that would ensure S3 subsystems were able to recover faster in the future.