rip.psg.com DNS Service Failure Post Mortem - 2024.06.14 (GMT-7)

TL;DR:
- rip.psg.com is primary for many zones
- rip DNSSEC signs three dozen, some ccTLDs
- rip was, and still is, target of a DNS DDoS attack
- Signing broke
- 2.5 days to restore, two more days to clean up oddities

History

Earlier this year, rip.psg.com was converted from bind with OpenDNSSEC
running on FreeBSD to knot on Debian 12.  Extracting keys from
OpenDNSSEC was an interesting exercise, props to sra.  We suspect that
more keys than desirable were produced.

The Event(s)

Sunday 2024.06.09 a TCP+UDP DNS DDoS started against rip.psg.com.
Unfortunately, rip was not doing rate limiting or cookies.  This allowed
the attack to be more effective than it should have been.  These have
been enabled, of course.

We assumed that other nameservers for the affected zones could carry the
load so there didn't seem a lot more to be done.  Then we noticed that
the DNS for psg.com was broken.  sra investigated and found that all the
RRSIGs in the psg.com zone had expired three hours previously.

So we asked knot to re-sign.  It complained about key validity checks
and refused to load the zones.  We looked for errors but found the KSKs
we expected, so eventually Randy had the clever idea of throwing the
"no-check-keyset" switch, which allowed the zones to load.  While In
retrospect, this was a mistake, but it appeared to solve the immediate
problem.  It was only *appeared*, the problem was not solved.

The evening of 2024.06.11 we finally proved to ourselves that DNSSEC was
broken, via a test involving a machine in another POP running a
combination of `delv`, `unbound`, and `tcpdump`, which is where we saw
answers coming back with DNSKEYs but no RRSIGs.  Weirdly, one zone,
ymbk.com, remained fine, served by the same knot instance with the same
configuration and tested with the same delv/unbound/tcpdump setup.

As Randy's mail and servers are in psg.com, with the broken DNS
signatures, he was unable to get much mail.  And outbound mail to any
destination which checked reverse DNS would not deliver as the reverse
zones for the address space also were broken.  So a communication issue
exacerbated the problem.

By the eve of Tuesday 2024.06.11, we were pretty worried, and Randy had
already queried for knot clue, but due to the knock on effect on
psg.com's email system we did not have an answer yet.  We couldn't find
any significant difference between psg.com's configuration and
ymbk.com's configuration, and the DNSSEC problem was affecting enough
zones that we were pretty sure it was not zone-data-specific.  Which
eventually led us to suspect the key database.  By the time we got to
this point we were short enough on sleep that we decided that pushing
more buttons at that point would likely just break things further, so we
left it until morning, with results you have seen.

Korry, over in WIDE, relayed a key suggestion from Libor over in the
Knot crew to selectively activate the most recent ZSK.  And that was
finally enough to whack the key database for the critical zones into
working again (thanks!).

We then cleaned the extra keys, also per Libor's suggestion.  We tested
this with one small and relatively non-critical zone (cryptech.is),
and it seemed to have worked.  So we cleaned the key sets for the rest
of the signed zones.

One Last Anomaly

While, at this point everything resolved, validated, ..., there
remained a problem that zonemaster and dnsviz were annoyed thay the
signed zones had multiple RRSIGs when they should have had only one.
Many theories were put forth, but Anand nailed it.

> Perhaps during the DDoS, the BIND secondary received a corrupt IXFR
> that added a new RRSIG, but didn't delete the old one?

Configuring knot to not allow IXFR and resigning the zones proved
Anand to be correct.

Conjecture

We're pretty sure that both the DDoS attack and the DNSSEC problem were
real, and that neither was just a symptom of the other.  Both had
sufficiently nasty effects that we would have noticed either one of them
independently within fairly short order, so it's not just that we didn't
see one of them until we went looking because of the other.  So the
question is the causal linkage, because having both failures show up the
same day seems a bit much to be a coincidence.  Half-assed theories,
none very attractive:

* CPU load due to the DDoS caused knot's signer to fall behind?

* Too many packets dropped during keyset validation checks caused knot
  to unblock a key roll we had somehow left hanging for months by
  failing to completely clean up the key database?

* A new and exciting bug in knot that we can't even describe properly,
  much less reproduce? :)

Back to History

A bit of perhaps-relevant history: the migration from OpenDNSSEC to knot
was a bit rough.  This was (at least) our third attempt to extract these
zones from OpenDNSSEC, and was the one that finally worked (all hail
knot's relatively clean design and key storage, at least by comparison
to all the alternatives!).  Due to initial misconfigurations, a number
of zones were accidentally set up with automatic KSK rolls, which was
not something we wanted; we eventually go this cleaned up in the config,
and thought things were stable, but did not think to perform a detailed
examination of the resulting key database.  In retrospect that was
probably a mistake too, we should have cleaned up anything we didn't
understand and hadn't requested.  Ideally, the configuration we want is
manual KSK update, automatic ZSK update, which is knot's default anyway.

Lessons

As it is possible the DDoS affected our ability to correctly sign, we
should probably move signing to a separate, and 'hidden' server.

We need deeper, more thorough, monitoring of the DNS service, DNSSEC in
particular.

Randy should construct a more resilient mail infrastructure, i.e. one
not sharing fate with the local DNS.

Thanks To:

  Anand, Håvard, knot folk, Korry, Liman, Mark, maz, Saad, sra, ...

-30-