Speaker
Rafael de Elvira
(Senior Software Engineer @ Slack)
Description
On September 30th 2021, Slack had an outage that impacted less than 1% of our online user base, and lasted for 24 hours. This outage was the result of our attempt to enable DNSSEC, but which ultimately led to a series of unfortunate events.
On this talk we'll cover our DNSSEC rollout to all Slack critical domains and the three failed attempts to enable DNSSEC on slack.com – doing a deep dive into our third attempt (the Sept 30th outage) – where we'll cover what was done during the outage, why we did it and ultimately the root cause of the outage, which was a bug in the DNSSEC implementation on our cloud provider authoritative DNS server.
Primary author
Rafael de Elvira
(Senior Software Engineer @ Slack)