Please disable your adblock and script blockers to view this page

The stack overflow of death. How we lost DNS and what we're doing to prevent this in the future.


CDN
SmartEdge
DNS
Garbage Collector
JSON
BinaryPack
GC
GeoDNS
API
Amazon
All Rights Reserved ©


Bunny

No matching tags


Edge Storage
the Edge Storage

No matching tags


Madrid
Frankfurt

No matching tags

Positivity     39.00%   
   Negativity   61.00%
The New York Times
SOURCE: https://bunny.net/blog/the-stack-overflow-of-death-dns-collapse/
Write a review: Hacker News
Summary

We immediately rolled back all updates for the SmartEdge system, but it was already too late.Both SmartEdge and the deployment systems we use rely on Edge Storage and Bunny CDN to distribute data to the actual DNS servers. As you can imagine, this essentially prevented the DNS servers from reaching the CDN to download the update and continued in a loop of crashes.As you can see at 8:35 (15:35), a few servers were still struggling to keep up with requests, but it wasn't with much effect and we dropped the majority of traffic, down to 100Gbit.At 8:45 we came up with a plan. Now connecting everything to a third-party service, we managed to sync up the databases, deploy the newest file sets and get things back online.We painstakingly watched traffic pick back up for 30 minutes, while making sure things were back online. We always try to deploy updates in a granular way using the canary technique, but this caught us off guard since an otherwise non-critical part of the infrastructure presented itself as a critical single point of failure for multiple other clusters at the same time.Finally, we are making the DNS system itself run a local copy of all backup data with automatic failure detection. This way we can add yet another layer of redundancy and make sure that no matter what happens, systems within bunny.net remain as independent from each other as possible and prevent a ripple effect when something goes wrong.I would like to share my thanks to the support team that was working tirelessly to keep everyone in the loop and all of our users for bearing with us while we battled through this.We understand this has been a very stressful situation not only for ourselves, but especially for all of you who rely on us to stay online, so we are making sure we learn and improve from these events and come out more reliable than ever.

As said here by Dejan Grofelnik Pelzel