Speaker
Description
Cisco's resolver fleet infrastructure commonly experiences large scale distributed denial of service (DDOS) attacks. Under the normal circumstances these attacks are dealt with by distributing the traffic over the installed resolver capacity and rarely get to cause operational issues. However on two occasions these DDOS attacks did cause notable internal incidents: thankfully with very limited customer impact. In this talk we will present how these attacks were detected and what was acted to remediate their effects.
In both occasions, attacks were detected through alarms from the resolver fleet complaining about the delayed traffic servicing and delayed configuration updates. However, the causes for resolvers having issues under these two attacks were different.
In the first case, it was noticed that DNAT traffic had a sudden increase during the incident which implied that resolvers in one data center (DC) had an increase in referrals to a different DC to query the authority servers. Blacklisting of the IP-s that were used for the DDOS attack proved to be of limited value due to the large pool of IP-s used. The problem was eventually tracked to the cache contention lock used for encryption of DNSCrypt transmissions in DNAT.
In the second incident, the DDOS attacks were very short lived and therefore difficult to analyze. Only when the team was able to get the state of the processor threads during one of the attack events it was possible to notice that a lot of threads were spinning in a lock that controls access to the list of in-transit upstream queries.
Query's domain name hash (folded into 12 bits) determines which of those 4096 locked lists were used. Multiple locks mean less lock contention, but only assuming good hash distribution. As it turned out, the implementation was using hashing of the first qname label and the target IP address, with the reasoning that these were the most volatile parts of the transmission data.
As a result, a random label attack against \textless{}onst\textgreater{}.\textless{}random\textgreater{}.\textless{}domain\textgreater{} would always hash to the same value and use the same lock.
Both incidents were eventually resolved through resolver software upgrades that improved the lock contention mechanism but for two rather different resolver resources. Interestingly, these lock contention issues escaped detection despite extensive application and performance testing of each software release. This emphasizes the need to include specific DDOS-type tests in the software release pipeline.
Talk duration | 20 Minutes (+5 for Q&A) |
---|