October 31, 2019 to November 1, 2019 AGM
JW Marriott Austin
America/Winnipeg timezone
OARC31 Presentation Videos available at https://youtube.com/DNS-OARC

Riding the DNS Camel on the Way to Deploying DNSSEC in a Large Enterprise

Oct 31, 2019, 11:00 AM
Griffin Hall (JW Marriott Austin)

Griffin Hall

JW Marriott Austin

110 E 2nd St Austin TX 78701 USA
No longer available: Standard Presentation Public Workshop


Han Zhang (Salesforce)


Propose Speakers: Allison Mankin and Han Zhang
amankin@salesforce.com, hzhang@salesforce.com


A large enterprise greatly valuing customer trust may decide to deploy DNSSEC on its domains. However, thanks to the camel that is DNS [0], there are many obstacles on the way there, including obstacles not specific to DNSSEC, and obstacles that are heightened because of deploying DNSSEC. This is especially the case if that enterprise has large zones with frequently changing records. In this presentation, we share some experiences and lessons from deploying DNSSEC on our very large live production zones, hoping these help other organizations when they take this road.

Challenges during Preparation:
Our preparation included comparing services and features of multiple managed DNS providers (a topic we presented in [1]). We also conducted DNSSEC functional and performance testing against those providers, simulating workloads of our production zones. We have a requirement to use multiple providers for resilience, and our preparation led us to the insight that multi-provider DNSSEC models are needed [2].

Because our enterprise constantly creates, modifies and deletes domains and records, using DNS in fairly expansive ways, one of the greatest challenges during preparation was to find vendors to handle monster and mega-monster-sized dynamic zones, where monster refers to O(1M), and mega-monster to O(10M). These large zones also have 5-6 figures of changes per day. We learned how vendors manage signed versions of zones with these properties. We will share some information about tradeoffs.

Throughout this talk, we will not identify vendors by name.

Challenges during Deployment:
Because of the feature gaps, our DNSSEC deployment included migrating live zones to new vendors, which had to be done without causing any downtime for the customers and internal applications depending on these DNS data. We learned that the move phase and the DNSSEC-enabling phase needed to be well-separated.

Specifically, the challenges of seamlessly migrating a zone include i) ensure that there are no assumptions in applications that will be impacted by a vendor switch (and be ready to roll back test migrations and wait for fixes); ii) migrate a zone to be provisioned in an active-active but be prepared to handle some inconsistencies because vendors vary in how they bootstrap a zone, so there will be period in which the new zone does not receive updates; iii) be prepared to handle issues around delegation and sub-zones, one example is if you have delegated a sub-zone at a vendor but not the parent zone, but the parent zone is being updated there in preparation for a move, the vendor may answer for it even though the NS for that parent zone hasn’t changed. We will present both versions of this and their implication about the publishing DS stage of deployment; iv) be prepared for surprises by being able to roll back quickly; v) in contrast to that, be aware of the impact of the 48 hour COM TTL, which we will report about (it wasn’t as high impact as feared). Besides the zone migration, we will also talk about other challenges, such as digging out the non-standard and dynamic records (such as GLSB).

Avoiding Hazards after Deployment:
Our testing made us aware that very large signed zones could create enormous surges of resigning traffic, which would not necessarily be handled well by XFR. This isn’t a new observation, but it had large impact on us, and we worked with stack vendors to increase the skew of resigning to avoid a crisis further down the road. This is one example and the talk may present some other hazards where the camel and the trip could be blocked.

We are monitoring our DNSSEC zones for multiple hazards going forward. Because DNSSEC introduces more data for XFRs, the first thing we monitor is whether zones are synced among the multiple name servers (in the multiple providers) and/or have significant propagation lags introduced. Zone propagation lags have improved for us because of the extensive review, and the migrations with which we prepared for DNSSEC. Another thing we monitor is DNSSEC correctness of the zones, which we have done by using dnsviz [3] to produce data for a dashboard and trigger pages when any errors are found. The DNSSEC change is not complete, so we will not cover customer experience testing and responses.

Overall, the road to DNSSEC had surprises but was travel-worth. This talk offers our experiences, as a first pass on best practices for large enterprises. We will conclude with thoughts on the cadence of DNSSEC changes.

[1] DNSSEC for a Complex Enterprise Network https://indico.dns-oarc.net/event/28/contributions/523/
[2] Multi-Signer DNSSEC Models https://indico.dns-oarc.net/event/31/contributions/683/ (https://indico.dns-oarc.net/event/31/contributions/683/attachments/667/1096/multi-signer.pdf)
[3] dnsviz: http://dnsviz.net (http://dnsviz.net/)

Talk Duration Lightning Talk 10 Minutes

Primary author

Han Zhang (Salesforce)

Presentation materials