It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it.
The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.
The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.
At my employer, we have a small script that automatically checks such generated config files. It does a diff between the old and the new version, and if the diff size exceeds a threshold (either total or relative to the size of the old file), it refuses to do the update, and opens a ticket for a human to look over it.
It has somewhat regularly saved us from disaster in the past.
I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).
That's why I likened it Crowdstrike. It's a signature database that blew up the consumer of said database. (You probably caught my post mid-edit, too. You may be replying to the snarky paragraph I felt better of and removed.)
Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.
Code and Config should be treated similarly. If you would use a ring based rollout, canaries, etc for safely changing your code, then any config that can have the same impact must also use safe rollout techniques.
You're the nth person on this thread to say that and it doesn't make sense. Events that happen multiple times per second change data that you would call "configuration" in systems like these. This isn't `sendmail.cf`.
If you want to say that systems that light up hundreds of customers, or propagate new reactive bot rules, or notify a routing system that a service has gone down are intrinsically too complicated, that's one thing. By all means: "don't build modern systems! computers are garbage!". I have that sticker on my laptop already.
But like: handling these problems is basically the premise of large-scale cloud services. You can't just define it away.
I'm sorry to belabor this but I'm genuinely not understanding what you're saying in this reply. I haven't operated large scale systems. I'm just an IT generalist and casual coder. I acknowledge I'm too inexperienced to even know what I don't know re: running large systems.
I read the parent poster as broadly suggesting configuration updates should have fitness tests applied and be deployed to minimize the blast radius when an update causes a malfunction. That makes intuitive sense to me. It seems like software should be subject to health checks after configuration updates, even if it's just to stop a deployment before it's widely distributed (let alone rolling-back to last-working configurations, etc).
Am I being thick-headed in thinking defensive strategies like those are a good idea? I'm reading your reply as arguing against those types of strategies. I'm also not understanding what you're suggesting as an alternative.
Again, I'm sorry to belabor this. I've replied once, deleted it, tried writing this a couple more times and given up, and now I'm finally pulling the trigger. It's really eating at me. I feel as though I must be deep down the Dunning-Kruger rabbit hole and really thinking "outside my lane".
The things you do to safeguard the rollout of a configuration file change are not the same as the things you do to reliably propagate changes that might happen many times per second.
What's irritating to me are the claims that there's nothing distinguishing real time control plane state changes and config files. Most of us have an intuition for how they'd do a careful rollout of a config file change. That intuition doesn't hold for control plane state; it's like saying, for instance, that OSPF should have canaries and staged rollouts every time a link state changes.
I'm not saying there aren't things you to do make real-time control plane state propagation safer, or that Cloudflare did all those things (I have no idea, I'm not familiar with their system at all, which is another thing irritating me about this thread --- the confident diagnostics and recommendations). I'm saying that people trying to do the "this is just like CrowdStrike" thing are telling on themselves.
I took the "this sounds like Crowdstrike" tack for two reasons. The write-up characterized this update as an every five minutes process. The update, being a file of rules, felt analogous in format to the Crowdstrike signature database.
I appreciate the OSPF analogy. I recognize there are portions of these large systems that operate more like a routing protocol (with updates being unpredictable in velocity or time of occurrence). The write-up didn't make this seem like one of those. This seemed a lot more like a traditional daemon process receiving regular configuration updates and crashing on a bad configuration file.
It is possible that any number of things people on this thread have called out are, in fact, the right move for the system Cloudflare built (it's hard to know without knowing more about the system, and my intuition for their system is also faulty because I irrationally hate periodic batch systems like these).
Most of what I'm saying is:
(1) Looking at individual point failures and saying "if you'd just fixed that you wouldn't have had an incident" is counterproductive; like Mr. Oogie-Boogie, every big distributed system is made of bugs. In fact, that's true of literally every complex system, which is part of the subtext behind Cook[1].
(2) I think people are much too quick to key in on the word "config" and just assume that it's morally indifferentiable from source code, which is rarely true in large systems like this (might it have been here? I don't know.) So my eyes twitch like Louise Belcher's when people say "config? you should have had a staged rollout process!" Depends on what you're calling "config"!
I just want to point out a few things you may overlooked. First, the bot config gets updated every 5 minutes, not in seconds. Second, they have config checks in other places already ("Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input").
They could probably even align everything in CI/CD if they'd run the config verifier where the configs are generated. This is of course all hindsight blind guessing, but you make it sound a bit arcane and impossible to do anything.
The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.
The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.