Facebook outage a prime example of insider threat by machine

A buggy automated audit tool and human error took Facebook offline for six hours. Key lesson for CISOs: Look for single points of failure and hedge your bets.

please stand by problem technical difficulties tv mistake test screen by filo getty
Filo / Getty

The longest six hours in Facebook’s history took place on October 4, 2021, as Facebook and its sister properties went dark. The social network suffered a catastrophic outage. The only silver lining to the outage, if there is one, is that the outage wasn’t caused by malicious actors. Rather, it was a self-inflicted wound caused by Facebook’s own network engineering team.

According to the first engineering blog post from Facebook on October 4, they fingered “configuration changes on the backbone routers that coordinated network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”

They followed up their blog post on October 5 with more details: “A command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all connections in our backbone network, disconnecting Facebook data centers globally.” The blog explained how their systems have fail-safe processes in place to prevent this type of mistake, but “a bug in that audit tool prevented it from properly stopping the command.”

Yes, yet another instance where the machines turned out to be the insider that caused the havoc.

Impact of a machine-based insider event

A Domain Name System (DNS) error caused their BGP (border gateway protocol) messages to essential go blank. Neither Facebook (Instagram/WhatsApp), nor the internet could find them. When the audit tool failed, the platforms themselves were unreachable. The company wasn’t able to operate remotely, so all work had to be managed locally. Imagine the gyrations that were necessary to manually bypass all the technological barriers to entry that were in place and were now defaulting to their error status.

Additionally, it was widely reported that the same internal infrastructure supported various internet of things (IoT) devices and services within the company itself were affected, to include access control, company email, and employee online workspaces – all are managed in house.

The impact went beyond Facebook’s 3.5 billion users eager to share their photos, opinions, and recipes. Third-party entities that tied their authentication process to Facebook had clients/customers/employees unable to access their accounts. Individual users who opted to use their Facebook account as their log-in were also found twiddling their thumbs waiting for the outage to end, as access to their desired domains was being blocked due to the unavailability of the authentication processes.

Lessons for CISOs from the Facebook outage

Is this an instance of technical decisions being made by non-technical leaders? Cary Conrad, chief development officer at SilverSky, comments how the self-inflicted outage is “emblematic of a broader leadership issue in the tech world.” He observes how he has seen for more than 20 years how “Good management trumps good technology every time, yet due to the ever-changing threatscape of the tech industry, inexperienced leadership is oftentimes relied upon for the sake of expediency.” He continues, how within the world of cybersecurity, “The Peter Principle is in full effect. People progress to their level of incompetence, meaning a lot of people in leadership within cyber have risen to a level that is difficult for them to execute and often lack formal technical training. As a CISO, there is a need to configure, identify, and negotiate the cost of protecting an organization, and without the adequate experience or a disciplined approach, this mission is executed poorly.”

While the knee-jerk reaction may be to punish the engineer who gave the update order, that would be misdirected ire. The real culprit, in this instance, is Facebook’s own architecture. It allowed their network to fail the most basic of network tenets: Do not allow for a single point of failure.

Facebook’s infrastructure collapsed when the automated audit process failed due to an undetected (or known but not yet mitigated) bug.

Tom Krazit and Joe Williams hit the nail on the head with their summation published in protocol of the three learning opportunities for CISOs which come out of Facebook’s outage:

  • Plan for the worst. Enterprises need a contingency plan for the complete loss of their computing resources or network connection, not just the loss of a data center or cloud region.
  • Hedge your bets. It's extremely unlikely that the entire internet will go down at the same time; hedging at least a few bets across multiple service providers could be worth the effort.
  • Check your priorities. There's no way to run an operation the size of Facebook without a serious amount of automation, which means code-auditing tools like the one that failed to stop this outage need extra attention.

October 4 was a bad day for Facebook, and a tweet from Jonathan Zittrain, Harvard Law professor at the School of Engineering and Applied Science, wryly summarized it: Facebook basically locked its keys in the car.

Copyright © 2021 IDG Communications, Inc.

7 hot cybersecurity trends (and 2 going cold)