Facebook sorry something Went Wrong

Facebook Sorry Something Went Wrong - Early today Facebook was down or unreachable for a number of you for around 2.5 hrs. This is the most awful blackout we've had in over four years, and we wanted to firstly excuse it. We also wished to provide much more technological information on what occurred and share one big lesson found out.

What's Wrong With Facebook

Facebook Sorry Something Went Wrong


The vital imperfection that created this failure to be so extreme was an unfavorable handling of a mistake condition. An automated system for confirming arrangement worths ended up triggering much more damages than it repaired.

The intent of the computerized system is to look for arrangement values that are void in the cache and also replace them with upgraded worths from the consistent shop. This works well for a transient issue with the cache, however it doesn't function when the consistent store is invalid.

Today we made an adjustment to the persistent copy of a setup worth that was taken invalid. This meant that each and every single customer saw the void value and also attempted to fix it. Due to the fact that the fix includes making a query to a cluster of data sources, that collection was quickly bewildered by hundreds of hundreds of queries a 2nd.

To make issues worse, each time a client got an error trying to query one of the data sources it interpreted it as a void worth, and removed the matching cache secret. This suggested that also after the initial issue had been repaired, the stream of questions continued. As long as the databases fell short to service some of the demands, they were triggering even more requests to themselves. We had entered a comments loop that didn't permit the data sources to recoup.

The method to quit the feedback cycle was fairly excruciating - we needed to stop all website traffic to this database cluster, which implied turning off the site. Once the databases had recouped and also the source had been fixed, we slowly allowed even more individuals back onto the site.

This obtained the website back up as well as running today, and also for now we've turned off the system that tries to correct arrangement values. We're exploring new designs for this configuration system following design patterns of other systems at Facebook that deal even more gracefully with comments loops as well as short-term spikes.

We apologize again for the website interruption, and also we want you to know that we take the performance and also dependability of Facebook extremely seriously.