Facebook You Re Doing It Wrong

Facebook You Re Doing It Wrong - Early today Facebook was down or unreachable for many of you for about 2.5 hours. This is the worst blackout we have actually had in over 4 years, and also we wanted to first of all apologize for it. We likewise intended to offer far more technological information on what happened as well as share one large lesson discovered.

What's Wrong With Facebook

Facebook You Re Doing It Wrong


The vital imperfection that caused this interruption to be so serious was a regrettable handling of a mistake condition. An automatic system for validating arrangement values wound up causing far more damage than it dealt with.

The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated worths from the persistent shop. This functions well for a short-term issue with the cache, however it does not work when the consistent store is void.

Today we made an adjustment to the relentless duplicate of a setup value that was taken void. This indicated that each and every single customer saw the invalid worth as well as attempted to repair it. Due to the fact that the repair includes making an inquiry to a cluster of data sources, that cluster was promptly bewildered by thousands of hundreds of questions a 2nd.

To make matters worse, every single time a client got an error trying to query one of the data sources it translated it as an invalid value, and also erased the equivalent cache key. This suggested that even after the initial issue had actually been dealt with, the stream of queries proceeded. As long as the databases stopped working to service a few of the demands, they were causing even more requests to themselves. We had actually entered a feedback loop that didn't allow the databases to recuperate.

The way to stop the responses cycle was quite painful - we needed to stop all traffic to this database cluster, which indicated switching off the website. When the databases had actually recouped and also the origin had actually been fixed, we slowly permitted more individuals back onto the site.

This obtained the website back up as well as running today, as well as in the meantime we have actually shut off the system that attempts to correct setup worths. We're exploring brand-new styles for this arrangement system adhering to design patterns of various other systems at Facebook that deal more gracefully with comments loopholes as well as transient spikes.

We ask forgiveness once again for the site failure, and we want you to know that we take the efficiency as well as reliability of Facebook really seriously.