29 September 2017 to 3 October 2017
Fairmont San Jose
US/Pacific timezone

DNS Service Monitoring at Salesforce

29 Sept 2017, 15:15
15m
Regency 2 Ballroom (Fairmont San Jose)

Regency 2 Ballroom

Fairmont San Jose

170 S Market Street, San Jose, 95113, CA, USA
Standard Presentation Public Workshop Public Workshop

Speaker

Dr Han Zhang (Salesforce)

Description

Highly available and responsive DNS access is critical to earn the trust of enterprise customers for a Software as a Service (SaaS) cloud company such as Salesforce. To provide better DNS services, we have implemented several types of monitoring which are different from those of infrastructure DNS organizations, but we think provide useful insights about DNS. Monitoring DNS services benefits us from various aspects. First, it provides us real-time and time-series status of our DNS services. For example, we can monitor whether the authoritative name servers always answer the DNS queries. Second, we can use the monitoring to benchmark DNS services of multiple vendors to compare their performance. This helps executives and individual contributors in planning and decision-making. Third, regularly sharing the monitoring reports to vendors can improve DNS services. Last, monitoring also helps find hidden problems, for example, some undocumented Rest API behaviors. Currently we monitor the DNS services provided by our vendors from three vantages. The first is the reachability of the authoritative name servers and also the vendors’ Rest API. The second is the consistency of vendors’ name servers, meaning that they provide the same data. The third is propagation delay, specifically, if we make some DNS record changes, how soon the changes can be visible on the name servers. We have built dashboards to visualize the monitoring results using monitoring and alerting platforms developed by Salesforce, ReFocus[1] and Argus[2]. Both tools are open source. Refocus is a platform to visualize the health of many services in a manageable way. At Salesforce, ReFocus connects our numerous monitoring sources into a single, self-service platform. Argus is a time-series monitoring system that allows engineering teams to collect, store, annotate, and alert on massive amounts of time-series data, using a scalable, resource-protected architecture. Argus is similar to Graphite[3], but has been refactored to improve scaling behavior for the very diverse requirements of a large cloud organization. In this presentation, we will first talk about how primary/secondary and active/active is used at Salesforce. Then, we will discuss how we monitor name servers’ availability, consistency, and propagation delay, followed by showing how we use ReFocus and Argus to visualize these results. We have been running the monitoring and collecting data since the end of 2016. We have done some analysis using the collected data and we will present these results. A first draft of the presentation is included in this submission. [1] ReFocus: https://engineering.salesforce.com/take-a-moment-to-refocus-86b6546c90c [2] Argus: https://engineering.salesforce.com/argus-time-series-monitoring-and-alerting-d2941f67864 [3] Graphite: https://graphiteapp.org/
Talk Duration 15 Minutes

Primary authors

Diana Akrami (University of California, Berkeley) Dr Han Zhang (Salesforce)

Co-authors

Allison Mankin (Salesforce) Tim Wicinski (Salesforce)

Presentation materials