What really caused the 2021 facebook outage (summary)

Hussein Nasser
Hussein Nasser
A summary of the outage that happened to the Facebook network on October 4th, 2021. On October 4th, 2021, Facebook, ...
A summary of the outage that happened to the Facebook network on October 4th, 2021.


On October 4th, 2021, Facebook, Instagram and Whatsapp disappeared from the Internet for 6 hours. This is one of the major Internet outages in 2021

In this video, I'll summarize the cause of the outage, why did it take that long to restore, and finally go in detail reading both facebook’s and Cloudflare's articles and give my thoughts. Let's get into it


The Internet can be thought of as a network of networks connected and advertised by a protocol called BGP (border gateway protocol)

This protocol defines proper routes to get from one network to another and allows for multiple paths and routes options, redundancy, and efficient shortest path routing

Networks constantly update the routes in case of a change to propose better routes or more redundant paths.

What happened?

Now that we know how the Internet routes traffic, what exactly happened that caused Facebook to go dark

around 8:50 AM PST the Facebook network stopped advertising its presence to the other routers on the Internet. This was due to a configuration change Facebook made to their backbone router.

This means any IP address that belongs to the Facebook network cannot be routed since there are no longer any paths available.

The problem was first noticed with DNS, DNS requests to facebook.com were timing out. DNS servers have IP addresses just like any other server, and since Facebook Authoritative DNS servers which are located on the Facebook network couldn’t be reached, the DNS resolvers couldn’t route DNS requests to them and as a result timeout.  

Even if you did manage to get an IP address of facebook.com cached you won’t be able to route to it as the paths disappeared.

Why did it take so long?

My guess is internal tools that Facebook used to make the configuration update were locked behind and couldn’t be accessed mainly because everyone is working remotely. This might have made the job of reverting the configuration change harder. This is another reason why we didn’t get outage updates either.

Resources
https://engineering.fb.com/2021/10/04...
https://blog.cloudflare.com/october-2...
https://blog.cloudflare.com/cloudflar...
The Cloudflare Outage - What Happened...



Become a Member on YouTube
@hnasr

🔥 Members Only Content
Members-only videos

Support my work on PayPal
https://bit.ly/33ENps4


🧑‍🏫 Courses I Teach
https://husseinnasser.com/courses

🏭 Backend Engineering Videos in Order
https://backend.husseinnasser.com

💾 Database Engineering Videos
Database Engineering
🎙️Listen to the Backend Engineering Podcast
https://husseinnasser.com/podcast

Gears and tools used on the Channel (affiliates)

🖼️ Slides and Thumbnail Design
Canva
https://partner.canva.com/c/2766475/6...

🎙️ Mic Gear
Shure SM7B Cardioid Dynamic Microphone  
https://amzn.to/3o1NiBi

Cloudlifter
https://amzn.to/2RAeyLo

XLR cables
https://amzn.to/3tvMJRu

Focusrite Audio Interface
https://amzn.to/3f2vjGY


📷 Camera Gear
Canon M50 Mark II
https://amzn.to/3o2ed0c

Micro HDMI to HDMI
https://amzn.to/3uwCxK3

Video capture card
https://amzn.to/3f34pyD

AC Wall for constant power
https://amzn.to/3eueoxP


Stay Awesome,
Hussein

همه توضیحات ...