Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for the explanation! This definitely reminds me of CrowdStrike outages last year:

- A product depends on frequent configuration updates to defend against attackers.

- A bad data file is pushed into production.

- The system is unable to easily/automatically recover from bad data files.

(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)





It might remind you of Crowdstrike because of the scale.

Outages are in a large majority of cases caused by change, either deployments of new versions or configuration changes.


zone your deployments first -blue/green. Have a small blue zone, and test it out. If it works, then expand to green deployments.

A configuration file should not grow! design failure here, I want to understand




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: