Facebook, as most of you no doubt know, was not functional for several hours last week, and it drove a lot of people up a wall. It’s certainly not the best thing for the social network’s brand and reputation. Facebook Software Engineering Director Robert Johnson recently tried to explain what his company said was the “worst outage we’ve had in over four years.
“Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second,” Johnson explained. “To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued.”
Dealing with over 500 million people on a network is extremely challenging, we’re sure, but keeping them happy is critical to preserving a positive brand association.
Source: Johnson’s FB blog