#Introduction
As the operator of the .nl ccTLD, SIDN is very interested in keeping the .nl zone as safe as possible.
Analyzing the query data can help to detect cybercrime activity in the .nl zone which we can than try to cleanup.
Traditional DNS query data analysis done by storing data as PCAP's and analyzing them with tools such a tshark and wireshark is often a slow and painful process.
When storing DNS query data as PCAP files makes you will quickly run into performance and scalability problems.
Most tooling used to analyze PCAP's is single threaded and has limited or no sql compatibility.
What is required is a system which can cope with large volumes of PCAP data and still offer good query performance.
That's why SIDN developed a DNS big data platform called ENTRADA, this platform is built on top of the Hadoop stack using open source technology.
DNS query data from our authoritative name servers is stored on this platform and can be analyzed using multiple interfaces and languages.
The system supports SQL, which means that anyone with SQL knowledge can quickly start analyzing the query data.
Currently the database contains over 64 billion DNS queries, each day some 130 million new queries are added and this number will grow as we hook up more name servers.
In this presentation I will be talking about system design, use cases and our experiences.
#Platform design
The platform at SIDN is used by the R&D team and is quite small (5 nodes)
The costs of setting up such a cluster are very modest, the main components are as expected hardware and people.
The hardware does not have to be enterprise grade and much of the required knowledge is available for free online.
Adding more storage and compute capacity is as simple as adding more disk drives or servers.
The cluster storage capacity at the moment is about 100 billion DNS queries and this data can be queried very efficiently. Depending on the type
of query and number of data partitions that have to be scanned, most queries will return a result within seconds.
#Privacy
Privacy is an important aspect when collecting DNS data, because the query data might reveal personal information about the users who are sending DNS queries.
The IP address of a client can in some cases be used to identify and track users (for a home user operating a private resolver, or for small shared resolvers)
We designed a novel privacy framework (1) because it introduces privacy management to the use of DNS data
and (2) because, to that end, it integrates legal, organizational and technical aspects of privacy management.
This is described in our paper: https://www.sidnlabs.nl/uploads/tx_sidnpublications/SIDN_Labs_Privacyraamwerk_Position_Paper_V1.4_ENG.pdf
#Workflow
The time it takes from a query being received on the name server until it is available in the database for analysis is just a couple of minutes.
The steps involved are:
- get pcap data every x minutes from NS
- PCAP conversion
- enrichment of data
- storage
- query!
#Storage
There are a lot of different storage technologies, we chose to use the Parquet format to encode the data and Hadoop HDFS as a distributed storage layer.
This part explains why Parquet is such a good fit for storing DNS data.
- Why we chose Parquet
- Size difference ( pcap vs parquet, total database size)
- How do you convert pcap data to parquet (write parquet with Avro schema (KiteSDK))
- Parquet format can be read by Impala but also by Spark, this makes it very flexible.
#Query engines and interfaces
The data stored in the system can be access through multiple query engines and interfaces.
The support workloads from a simple sql query to advanced graph and machine learning jobs.
Impala/Impyla (SQL engine)
Spark (SQL/Graph/Machine learning engine)
Hue (SQL web interface)
Jupyter (python notebook)
#use cases
Focused on increasing the security and stability of .nl
- DNS security App (visualize traffic patterns for phishing domain names)
- Botnet detector (detect botnet infections and report these to abuse information exchange (https://www.abuseinformationexchange.nl/english))
- Real-time Phishing domain name detection
- Statistics dashboard (stats.sidnlabs.nl)
- Scientific research (collaboration with Dutch Universities)
- Ad-hoc operational analysis (quick analysis of current issues in the DNS)
#experiences
Our experiences in working with this data:
So much work to be done when this data is available, we hired an additional Data scientist.
Future work:
- Combine passive data from .nl authoritative name server with active scans of the complete .nl zone and ISP data.
- Adding more name servers and resolvers.
- Open data interface
#Summary
1. We believe that our choice of technology combined with our privacy framework is quite novel.
2. Our setup proves that a big data platform can start small with limited costs and still be powerful.
3. We provide the rational behind our architectural decisions with regards to tools, workflow and data formats
for storage.
4. We provide example use cases of what is possible when this data is available for analysis.