rip.psg.com DNS Service Failure Post Mortem - 2024.06.14 (GMT-7) TL;DR: - rip.psg.com is primary for many zones - rip DNSSEC signs three dozen, some ccTLDs - rip was, and still is, target of a DNS DDoS attack - Signing broke - 2.5 days to restore, two more days to clean up oddities History Earlier this year, rip.psg.com was converted from bind with OpenDNSSEC running on FreeBSD to knot on Debian 12. Extracting keys from OpenDNSSEC was an interesting exercise, props to sra. We suspect that more keys than desirable were produced. The Event(s) Sunday 2024.06.09 a TCP+UDP DNS DDoS started against rip.psg.com. Unfortunately, rip was not doing rate limiting or cookies. This allowed the attack to be more effective than it should have been. These have been enabled, of course. We assumed that other nameservers for the affected zones could carry the load so there didn't seem a lot more to be done. Then we noticed that the DNS for psg.com was broken. sra investigated and found that all the RRSIGs in the psg.com zone had expired three hours previously. So we asked knot to re-sign. It complained about key validity checks and refused to load the zones. We looked for errors but found the KSKs we expected, so eventually Randy had the clever idea of throwing the "no-check-keyset" switch, which allowed the zones to load. While In retrospect, this was a mistake, but it appeared to solve the immediate problem. It was only *appeared*, the problem was not solved. The evening of 2024.06.11 we finally proved to ourselves that DNSSEC was broken, via a test involving a machine in another POP running a combination of `delv`, `unbound`, and `tcpdump`, which is where we saw answers coming back with DNSKEYs but no RRSIGs. Weirdly, one zone, ymbk.com, remained fine, served by the same knot instance with the same configuration and tested with the same delv/unbound/tcpdump setup. As Randy's mail and servers are in psg.com, with the broken DNS signatures, he was unable to get much mail. And outbound mail to any destination which checked reverse DNS would not deliver as the reverse zones for the address space also were broken. So a communication issue exacerbated the problem. By the eve of Tuesday 2024.06.11, we were pretty worried, and Randy had already queried for knot clue, but due to the knock on effect on psg.com's email system we did not have an answer yet. We couldn't find any significant difference between psg.com's configuration and ymbk.com's configuration, and the DNSSEC problem was affecting enough zones that we were pretty sure it was not zone-data-specific. Which eventually led us to suspect the key database. By the time we got to this point we were short enough on sleep that we decided that pushing more buttons at that point would likely just break things further, so we left it until morning, with results you have seen. Korry, over in WIDE, relayed a key suggestion from Libor over in the Knot crew to selectively activate the most recent ZSK. And that was finally enough to whack the key database for the critical zones into working again (thanks!). We then cleaned the extra keys, also per Libor's suggestion. We tested this with one small and relatively non-critical zone (cryptech.is), and it seemed to have worked. So we cleaned the key sets for the rest of the signed zones. One Last Anomaly While, at this point everything resolved, validated, ..., there remained a problem that zonemaster and dnsviz were annoyed thay the signed zones had multiple RRSIGs when they should have had only one. Many theories were put forth, but Anand nailed it. > Perhaps during the DDoS, the BIND secondary received a corrupt IXFR > that added a new RRSIG, but didn't delete the old one? Configuring knot to not allow IXFR and resigning the zones proved Anand to be correct. Conjecture We're pretty sure that both the DDoS attack and the DNSSEC problem were real, and that neither was just a symptom of the other. Both had sufficiently nasty effects that we would have noticed either one of them independently within fairly short order, so it's not just that we didn't see one of them until we went looking because of the other. So the question is the causal linkage, because having both failures show up the same day seems a bit much to be a coincidence. Half-assed theories, none very attractive: * CPU load due to the DDoS caused knot's signer to fall behind? * Too many packets dropped during keyset validation checks caused knot to unblock a key roll we had somehow left hanging for months by failing to completely clean up the key database? * A new and exciting bug in knot that we can't even describe properly, much less reproduce? :) Back to History A bit of perhaps-relevant history: the migration from OpenDNSSEC to knot was a bit rough. This was (at least) our third attempt to extract these zones from OpenDNSSEC, and was the one that finally worked (all hail knot's relatively clean design and key storage, at least by comparison to all the alternatives!). Due to initial misconfigurations, a number of zones were accidentally set up with automatic KSK rolls, which was not something we wanted; we eventually go this cleaned up in the config, and thought things were stable, but did not think to perform a detailed examination of the resulting key database. In retrospect that was probably a mistake too, we should have cleaned up anything we didn't understand and hadn't requested. Ideally, the configuration we want is manual KSK update, automatic ZSK update, which is knot's default anyway. Lessons As it is possible the DDoS affected our ability to correctly sign, we should probably move signing to a separate, and 'hidden' server. We need deeper, more thorough, monitoring of the DNS service, DNSSEC in particular. Randy should construct a more resilient mail infrastructure, i.e. one not sharing fate with the local DNS. Thanks To: Anand, HÃ¥vard, knot folk, Korry, Liman, Mark, maz, Saad, sra, ... -30-