Speaker
Description
The DITL dataset serves as an invaluable resource for DNS research. The author gratefully acknowledges the data providers and DNS-OARC for permitting access to the Root DITL dataset. Because data collection methodologies vary significantly—with each Root Server Operator (RSO) capturing traffic to the best of their respective capabilities—it is essential to characterize the attributes of each dataset before analysis.
Despite this need, there is currently no standardized documentation regarding whether specific datasets are anonymized, the extent to which IP addresses are masked (e.g., prefix preservation), or whether the data represents partial or complete traffic logs. This presentation details an estimation of the DITL-2024 and 2025 dataset attributes:
Full Source IP Preservation: c (2024), g, k, and m-root datasets.
Partial Anonymization (Prefixes Preserved): a, b, d, f, h, and j-root datasets appear to mask source IPs but preserve /24 (IPv4) and /64 (IPv6) prefixes.
Full Anonymization (No Prefix Preservation): i and l (2024) root datasets.
Furthermore, by cross-referencing these datasets with RSSAC002 metrics for April 10, 2024, and April 9, 2025, I assessed data completeness. My findings suggest that the e-root dataset contains approximately 1% of total queries, the f-root dataset contains roughly one-third of the expected traffic, and the i-root dataset exhibits data gaps. Finally, as UDP checksums appear to be preserved in certain datasets, I attempted to reverse-engineer the original source IP addresses, with limited success in specific instances.
| Talk duration | 20 Minutes (+5 for Q&A) |
|---|