Dozens of websites and apps were knocked offline. Hundreds of thousands of business were impacted. Millions of dollars in revenue was lost. One command was entered incorrectly.
That’s the short version of the Amazon S3 Service Disruption that began on February 28th at 1:40 p.m. EST. On March 2nd, Amazon published the long version, including this excerpt:
“…an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended …”
This S3 administrator mistake may sound like an anomaly. One experienced admin who was going by the playbook makes one small mistake that decimates internet functionality for dozens of major websites and apps. In other words, “human error.”
But Amazon’s admin error disaster is not an anomaly. Mistakes happen everyday in IT departments around the world. They happen more often than most companies will admit. Why? 1. It could result in reputation loss, 2. Fear of burdensome compliance mandates, or 3. Inability to do anything about it so why cry foul! In fact, they happen so often that they have their very own slang term—“fat finger errors.” And they can be devastating.
Smart People. Honest Mistakes.
Admin errors aren’t typically the result of incompetence. Smart people can make honest mistakes. However, when it comes to something as critical as their data center environment, organizations must take every precaution to avoid them—or suffer the business crippling and financial consequences.
While every organization is implementing advanced technologies to keep nefarious hackers out of their systems and prevent cyber attacks, many don’t apply the same rigor to preventing internal mistakes that can bring down their systems. They sometimes forget that accidents happen internally, too. But they need to be as prepared to prevent admin mistakes as they are in preventing the next malware, ransomware, or phishing attack.
Amazon outlined how it’s working to fix the mistake the admin made at its Virginia facility. The fix includes:
“… the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level …”
However, everyone isn’t Amazon—and they can’t easily fix big issues like this. But the good news—even silver lining—of this incident is that safeguards exist today that can prevent and mitigate admin errors.
Prevent Business Disruption — The Two-Person Rule
The two-person rule is an implementation of the adage “trust, but verify.” It has been applied for years in situations and environments where a rogue privileged user acting alone could cause widespread damage. HyTrust’s CloudControl solution enables the same functionality through its Secondary Approval Workflow capability.
The Secondary Approval Workflow is an automated escalation process that requires a designated approver to authorize a sensitive operation attempted by an admin before the operation is allowed to proceed. Secondary Approval Workflow prevents costly disruptions caused by admins, both accidentally or intentionally
This is critical today, especially in private and public cloud environments, where admins typically have much greater power than previous generations of admins, who only managed physical data center infrastructure. Now, they can copy, power off, or delete thousands of virtual machines with a few clicks, as was proven at Amazon.
What’s more, incidents that cause widespread damage within organizations occur for reasons beyond just admin error. While this incident was a genuine user mistake, other problems can be lurking that can lead to the same outcome—namely, bad actors, both internally and externally. In fact they can intentionally do much more severe and long-lasting damage to organizations. They can disrupt businesses in ways that result in substantial operations downtime, serious compliance violations, or even confidential data breaches. If this happens the damage and cost to an organization can be devastating.
Why let a fat-finger error—or malicious action by a bad actor—bring your systems to a screeching halt, when your organization can simply automate the enforcement of Secondary Approval Workflows for those sensitive and high impact operations that need to be verified before they are executed.
When even the world’s largest cloud provider isn’t immune from this level of disruption, how prepared is your organization to advert the risk of a similar scenario?
Learn more about how HyTrust CloudControl has helped leading enterprises and major government agencies solve this very problem—and avert preventable admin mistakes like the one that brought Amazon S3 down. You don’t want your organization to be the next S3 – AWS will survive – even thrive – but you may not be so lucky!