May 10 – 11, 2014
Sofitel Warsaw Victoria
Europe/Warsaw timezone

Large scale regular expression recognition on the DITL data-set by using similarity search

May 10, 2014, 11:45 AM
Opera (Sofitel Warsaw Victoria)


Sofitel Warsaw Victoria

11 Królewska Street 00-065 Warsaw
Members-only Members-Only Session


Dr Arnoldo Muller-Molina (simMachines)


The day in the life (DITL) data-set is collected to study and improve the integrity of the root server system. Among the different properties recorded in the data-set, we focus on second level domain (SLD) strings. In this study, we introduce a method that automatically infers regular expressions from over-represented SLD strings. At first, we identify random strings and remove them from the data pipeline. Then, we find common string seeds that guide the elucidation process. Finally, we perform similarity search on strings that do not exceed a certain level of entropy level to generate a weight matrix that is then converted into regular expressions and their corresponding visualizations. Similarity search is a very expensive operation, but we manage to achieve fast results by using the simMachines R-01 similarity engine. The method may be used to preemptively discover security or performance issues in the infrastructure. During the talk, we will show a sample of collected regular expressions so that the community may identify familiar and unfamiliar SLD patterns.

Primary author

Dr Arnoldo Muller-Molina (simMachines)

Presentation materials