Large scale regular expression recognition on the DITL data-set by using similarity search
The day in the life (DITL) data-set is collected to study and improve the integrity of the root server system. Among the different properties recorded in the data-set, we focus on second level domain (SLD) strings. In this study, we introduce a method that automatically infers regular expressions from over-represented SLD strings. At first, we identify random strings and remove them from the data pipeline. Then, we find common string seeds that guide the elucidation process. Finally, we perform similarity search on strings that do not exceed a certain level of entropy level to generate a weight matrix that is then converted into regular expressions and their corresponding visualizations. Similarity search is a very expensive operation, but we manage to achieve fast results by using the simMachines R-01 similarity engine. The method may be used to preemptively discover security or performance issues in the infrastructure. During the talk, we will show a sample of collected regular expressions so that the community may identify familiar and unfamiliar SLD patterns.