Support - Help
DNS big data analytics
As the operator of the .nl ccTLD, SIDN is very interested in keeping the .nl zone as safe as possible. Analyzing the query data can help to detect cybercrime activity in the .nl zone which we can than try to cleanup. Traditional DNS query data analysis done by storing data as PCAP's and analyzing them with tools such a tshark and wireshark is often a slow and painful process. When storing DNS query data as PCAP files makes you will quickly run into performance and scalability problems. Most tooling used to analyze PCAP's is single threaded and has limited or no sql compatibility. What is required is a system which can cope with large volumes of PCAP data and still offer good query performance.
That's why SIDN developed a DNS big data platform called ENTRADA, this platform is built on top of the Hadoop stack using open source technology. DNS query data from our authoritative name servers is stored on this platform and can be analyzed using multiple interfaces and languages. The system supports SQL, which means that anyone with SQL knowledge can quickly start analyzing the query data. Currently the database contains over 64 billion DNS queries, each day some 130 million new queries are added and this number will grow as we hook up more name servers.
In this presentation I will be talking about system design, use cases and our experiences.
The platform at SIDN is used by the R&D team and is quite small (5 nodes) The costs of setting up such a cluster are very modest, the main components are as expected hardware and people. The hardware does not have to be enterprise grade and much of the required knowledge is available for free online.
Adding more storage and compute capacity is as simple as adding more disk drives or servers. The cluster storage capacity at the moment is about 100 billion DNS queries and this data can be queried very efficiently. Depending on the type of query and number of data partitions that have to be scanned, most queries will return a result within seconds.
Privacy is an important aspect when collecting DNS data, because the query data might reveal personal information about the users who are sending DNS queries. The IP address of a client can in some cases be used to identify and track users (for a home user operating a private resolver, or for small shared resolvers) We designed a novel privacy framework (1) because it introduces privacy management to the use of DNS data and (2) because, to that end, it integrates legal, organizational and technical aspects of privacy management.
This is described in our paper: https://www.sidnlabs.nl/uploads/tx_sidnpublications/SIDN_Labs_Privacyraamwerk_Position_Paper_V1.4_ENG.pdf
The time it takes from a query being received on the name server until it is available in the database for analysis is just a couple of minutes. The steps involved are:
- get pcap data every x minutes from NS
- PCAP conversion
- enrichment of data
There are a lot of different storage technologies, we chose to use the Parquet format to encode the data and Hadoop HDFS as a distributed storage layer. This part explains why Parquet is such a good fit for storing DNS data.
- Why we chose Parquet
- Size difference ( pcap vs parquet, total database size)
- How do you convert pcap data to parquet (write parquet with Avro schema (KiteSDK))
- Parquet format can be read by Impala but also by Spark, this makes it very flexible.
Query engines and interfaces
The data stored in the system can be access through multiple query engines and interfaces. The support workloads from a simple sql query to advanced graph and machine learning jobs.
Impala/Impyla (SQL engine) Spark (SQL/Graph/Machine learning engine) Hue (SQL web interface) Jupyter (python notebook)
Focused on increasing the security and stability of .nl
- DNS security App (visualize traffic patterns for phishing domain names)
- Botnet detector (detect botnet infections and report these to abuse information exchange (https://www.abuseinformationexchange.nl/english))
- Real-time Phishing domain name detection
- Statistics dashboard (stats.sidnlabs.nl)
- Scientific research (collaboration with Dutch Universities)
- Ad-hoc operational analysis (quick analysis of current issues in the DNS)
Our experiences in working with this data: So much work to be done when this data is available, we hired an additional Data scientist.
Future work: - Combine passive data from .nl authoritative name server with active scans of the complete .nl zone and ISP data. - Adding more name servers and resolvers. - Open data interface
- We believe that our choice of technology combined with our privacy framework is quite novel.
- Our setup proves that a big data platform can start small with limited costs and still be powerful.
- We provide the rational behind our architectural decisions with regards to tools, workflow and data formats for storage.
- We provide example use cases of what is possible when this data is available for analysis.
As the operator of the .nl ccTLD, SIDN is very interested in keeping the .nl zone as safe as possible. For this goal a DNS big data platform (ENTRADA) has been developed by SIDN. This talk will provide an overview of the ENTRADA DNS big data platform design, technology and use cases.
This talk with provide insight into the following aspects of DNS big data
- Big data does not have to mean big costs. It's possible to build a DNS big data cluster with low budget and make it grow. 2 What does such a platform look like? What are the architectural decisions? (tools, workflow, data formats for storage) How do you address privacy concerns?
- What are the use cases for DNS big data? We have developed some applications and are doing research together with universities.
The focus will be on: a) Major system components b) The technology used to build the platform. c) the possible use cases for DNS data analytics such as botnet detection and anti phishing.
keywords: DNS, Hadoop, Apache Parquet, Cloudera Impala/Impyla, Jupyter, Apache Spark.
Please also consider this submission for the NANOG65 DNS track