Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Security breaks during partial failures – design notes from distributed systems
7 points by sandhyavinjam 20 hours ago | hide | past | favorite | 1 comment
TL;DR: Many security mechanisms fail not during attacks, but during partial outages. This post documents early design notes for a failure-aware security framework for distributed systems.

The problem

In production distributed systems, security often breaks when things are half working:

auth services degrade → retries explode

fallback paths widen access

recovery logic becomes the attack surface

Nothing is “exploited”, yet the system becomes unsafe.

Most security models assume stable components and clean failures. Real systems don’t behave that way.

Design assumptions

We assume:

correlated failures

retries are adversarial

timeouts are unsafe defaults

recovery paths matter as much as steady-state logic

We don’t assume:

global consistency

perfect identity

reliable clocks

centralized enforcement

Framework ideas (high level)

This work explores four ideas:

1. Failure-aware trust

Trust degrades under failure, not just compromise

Access narrows automatically during partial outages

2. Security invariants at runtime

Invariants are continuously enforced

Violations trigger containment, not alerts

3. Retry-safe security primitives

Idempotent, monotonic, side-effect bounded

Retries can’t escalate privilege

4. Security as observable state

Trust level, degradation, and containment are visible

If you can’t observe it, you can’t secure it

What this is not

Not zero trust marketing

Not compliance

Not a finished system

It’s an attempt to treat failure as the normal case, not an exception.

Why publish this early?

Because many real failures:

don’t fit clean research papers

happen during incidents, not attacks

are invisible outside production systems

We’re sharing design notes to get feedback before formalizing or evaluating further.

Feedback welcome

If you’ve seen security regressions during outages or retries causing unsafe behavior, I’d like to hear about it.

This is ongoing work. No claims of novelty or completeness.








Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: