A buggy automated audit tool and human error took Facebook offline for six hours. Key lesson for CISOs: Look for single points of failure and hedge your bets. Credit: Filo / Getty The longest six hours in Facebook’s history took place on October 4, 2021, as Facebook and its sister properties went dark. The social network suffered a catastrophic outage. The only silver lining to the outage, if there is one, is that the outage wasn’t caused by malicious actors. Rather, it was a self-inflicted wound caused by Facebook’s own network engineering team.According to the first engineering blog post from Facebook on October 4, they fingered “configuration changes on the backbone routers that coordinated network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”They followed up their blog post on October 5 with more details: “A command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all connections in our backbone network, disconnecting Facebook data centers globally.” The blog explained how their systems have fail-safe processes in place to prevent this type of mistake, but “a bug in that audit tool prevented it from properly stopping the command.”Yes, yet another instance where the machines turned out to be the insider that caused the havoc. Impact of a machine-based insider eventA Domain Name System (DNS) error caused their BGP (border gateway protocol) messages to essential go blank. Neither Facebook (Instagram/WhatsApp), nor the internet could find them. When the audit tool failed, the platforms themselves were unreachable. The company wasn’t able to operate remotely, so all work had to be managed locally. Imagine the gyrations that were necessary to manually bypass all the technological barriers to entry that were in place and were now defaulting to their error status.Additionally, it was widely reported that the same internal infrastructure supported various internet of things (IoT) devices and services within the company itself were affected, to include access control, company email, and employee online workspaces – all are managed in house. The impact went beyond Facebook’s 3.5 billion users eager to share their photos, opinions, and recipes. Third-party entities that tied their authentication process to Facebook had clients/customers/employees unable to access their accounts. Individual users who opted to use their Facebook account as their log-in were also found twiddling their thumbs waiting for the outage to end, as access to their desired domains was being blocked due to the unavailability of the authentication processes.Lessons for CISOs from the Facebook outageIs this an instance of technical decisions being made by non-technical leaders? Cary Conrad, chief development officer at SilverSky, comments how the self-inflicted outage is “emblematic of a broader leadership issue in the tech world.” He observes how he has seen for more than 20 years how “Good management trumps good technology every time, yet due to the ever-changing threatscape of the tech industry, inexperienced leadership is oftentimes relied upon for the sake of expediency.” He continues, how within the world of cybersecurity, “The Peter Principle is in full effect. People progress to their level of incompetence, meaning a lot of people in leadership within cyber have risen to a level that is difficult for them to execute and often lack formal technical training. As a CISO, there is a need to configure, identify, and negotiate the cost of protecting an organization, and without the adequate experience or a disciplined approach, this mission is executed poorly.”While the knee-jerk reaction may be to punish the engineer who gave the update order, that would be misdirected ire. The real culprit, in this instance, is Facebook’s own architecture. It allowed their network to fail the most basic of network tenets: Do not allow for a single point of failure.Facebook’s infrastructure collapsed when the automated audit process failed due to an undetected (or known but not yet mitigated) bug.Tom Krazit and Joe Williams hit the nail on the head with their summation published in protocol of the three learning opportunities for CISOs which come out of Facebook’s outage:Plan for the worst. Enterprises need a contingency plan for the complete loss of their computing resources or network connection, not just the loss of a data center or cloud region.Hedge your bets. It’s extremely unlikely that the entire internet will go down at the same time; hedging at least a few bets across multiple service providers could be worth the effort.Check your priorities. There’s no way to run an operation the size of Facebook without a serious amount of automation, which means code-auditing tools like the one that failed to stop this outage need extra attention.October 4 was a bad day for Facebook, and a tweet from Jonathan Zittrain, Harvard Law professor at the School of Engineering and Applied Science, wryly summarized it: Facebook basically locked its keys in the car. Related content news FBI probes into Pennsylvanian water utility hack by pro-Iran group Federal and state investigations are underway for the recent pro-Iran hack into a Pennsylvania-based water utility targeting Israel-made equipment. By Shweta Sharma Nov 29, 2023 4 mins Cyberattacks Utilities Industry feature 3 ways to fix old, unsafe code that lingers from open-source and legacy programs Code vulnerability is not only a risk of open-source code, with many legacy systems still in use — whether out of necessity or lack of visibility — the truth is that cybersecurity teams will inevitably need to address the problem. By Maria Korolov Nov 29, 2023 9 mins Security Practices Vulnerabilities Security news Amazon’s AWS Control Tower aims to help secure your data’s borders As digital compliance tasks and data sovereignty rules get ever more complicated, Amazon wants automation to help. By Jon Gold Nov 28, 2023 3 mins Regulation Cloud Security news North Korean hackers mix code from proven malware campaigns to avoid detection Threat actors are combining RustBucket loader with KandyKorn payload to effect an evasive and persistent RAT attack. By Shweta Sharma Nov 28, 2023 3 mins Malware Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe