But the rapid deployment mechanism for bot features wasn’t where the bug was introduced.
In fact, the root bug (faulty assumption?) was in one or more SQL catalog queries that were presumably written some time ago.
(Interestingly the analysis doesn’t go into how these erroneous queries made it into production OR whether the assumption was “to spec” and it’s the security principal change work that was faulty. Seems more likely to be the former.)
It was a change to the database that is used to generate a bot management config file. That file was the proximate cause for the panics. The kind of observability that would have helped here is “panics are elevated and here are the binary and config changes that preceded it,” along with a rollback runbook for it all.
Generally I would say we as an industry are more nonchalant about config changes vs binary changes. Where an org might have great processes and systems in place for binary rollouts, the whole fleet could be reading config from a database in a much more lax fashion. Those systems are quite risky actually.
I am genuinely curious (albeit skeptical!) how anyone like Cloudflare could make that kind of feedback loop work at scale.
Even only in CF’s “critical path” there must be dozens of interconnected services and systems. How do you close the loop between an observed panic at the edge and a database configuration change N systems upstream?
In fact, the root bug (faulty assumption?) was in one or more SQL catalog queries that were presumably written some time ago.
(Interestingly the analysis doesn’t go into how these erroneous queries made it into production OR whether the assumption was “to spec” and it’s the security principal change work that was faulty. Seems more likely to be the former.)