Many DNS operators, particularly those of high volume authoritative servers (such as TLD operators) perform operational monitoring of incoming (and outgoing) DNS query load. Often, this entails capturing (and subsequently storing and analyzing) the query/response stream.
With DNS query rates (and hence traffic) increasing year by year, operators face the challenge that capture, transport and storage of the full message stream exceeds their operational capacity, and operating the capture infrastructure is more resource intensive than operating the actual DNS service. Furthermore, operators of DNS outsourcing solutions (such as anycast operators) experience similar challenges, at magnitudes worse than operators of a single zone.
An obvious solution to this is to limit traffic capture to a certain subset of the messages received and sent. For example, some operators perform full capture for a few minutes each hour only, but this limits the "visibility" of traffic to a very small subset of continues time.
In statistics, the problem of choosing a certain subset out of a large set of observations is very frequent, and called "Sampling". The most commonly used methods for Sampling are "random", "systematic", and "stratified", each of them involving a different algorithm to identify the observations to be selected.
Together with SIDN, nic.at runs a project with a student to investigate the properties of the invidual sampling methodologies with regards to DNS traffic. While research around this Master Thesis paper is still in progress, nic.at decided to also kick off a project to implement DNS sampling in practice in a widely available tool.
This talk explores the reasoning, current status, and result of adding systematic sampling to "dnscap", a widely used DNS traffic capture utility. The talk also explores how damage to figures which are badly affected by sampling (such as the total number of unique clients and total number of unique QNAMES) can be offset by adding an aggregating data structure right into the capturing tool.
This aggregating data structure is called "HyperLogLog", allows estimation of the cardinality of items in set, and does this in a memory constant way. No matter how many elements are inserted in the structure, the amount of memory required stays constant, eg. at about 12kB in the presented prototype implementation.
The obvious practical application in DNS for a "HyperLogLog" structure is therefore to provide an estimation of the cardinality of client IP addresses and QNAMEs, without the need to store and analyze the full data. One of the most amazing aspects of the HyperLogLog structure is that given two servers A and B, each of the servers could create an individual HyperLogLog structure, and the “summary” of the cardinality of the union of queries from A and B can simply be calculated by creating the union of the HyperLogLog structures from both servers – a property that fits perfectly to the DNS structure of highly parallel installations in anycast networks. So, instead of transferring the whole list of unique names for aggregation, each nameserver would just need to dump their own HyperLogLog structure (few kB’s) to a central location, at which the final aggregation could take place very efficiently.
A "proof-of-concept" level implementation of “HyperLogLog” in dnscap exists, and is presented during the talk.