In large-scale DNS deployments, zone updates are made by DNS hosting services on short timescales to large numbers of servers. It’s normal for the updates to be somewhat asynchronous from each other because DNS is “eventually consistent” by design. The question for DNS operations is how much lag is OK. For our organization, customer sensitivity to stale information is high and we want to take action by asking the hosting services to check for server problems when there are anomalous lags, but which lags are anomalous? What threshold of difference in the SOA serial numbers indicates a problem?
This talk presents a project to use an unsupervised machine learning algorithm to identify the anomalous points in lag data to help with this question of what is actionable. The approach was to use the PyCaret open-source Python machine learning library to train and build a model. We chose PyCaret to start with because it includes many unsupervised ML algorithms and then we selected the Isolation Forest (iforest) clustering ML algorithm, an algorithm commonly used for anomaly detection because it requires limited memory and is relatively fast in performance.
Additionally, after the model was trained using unsupervised learning, it was fine-tuned with the probe_time of the data points given as its supervised_target. Fine-tuning performs additional training by focusing on update times to improve the model’s performance.
The model was trained on DNS monitor logs that include information about zone updates made by DNS hosting services. This model was trained on SOA serial number changes in the multiple anycast locations for three production zones, my.salesforce.com, salesforce.com, and force.com. The output from the model marks each data point as an anomaly or normal and also provides a specific anomaly score where the higher the score, the more anomalous the datapoint. Analyzing the model output gives information regarding zone updates. This information includes learning about trends and patterns of zones, ranges of lags, lag dependence on zone sizes, etc.
One of the results of the project was to provide good numeric thresholds for how much lag was normal, now much anomalous. The thresholds were extremely different for the three zones (see slide #7). The thresholds indicated that lagging depended on zone sizes and how frequently the zones were updated, which is not surprising, but the actual numbers were surprising. These thresholds are accurate because they mark where truly anomalous behaviors begin. The original default thresholds that were in use prior to the model were guesstimates by the team and were the same for all three zones and had not proven predictive.
The next step of the project is to automate the model prediction by periodically running the model and integrating the new thresholds into the monitoring system (as well as preserving the history).
ML has been used on DNS data frequently, especially focusing on DNS attacks and security. DNS anomaly detection plays a big role operationally in DDOS mitigation for example. But this work covers an area in DNS availability and it seems likely that ML models, both unsupervised and supervised, can be valuable to many other areas where DNS operators have data in real-time in the future.