
Aydarbek Romanuly Last Updated: February 19, 2026
Collected at: https://www.iotforall.com/durability-first-iot
Modern systems generate streams of events everywhere: devices at the edge, gateways, backend services, and cloud workloads. What often gets overlooked is that failure is the normal state, not the exception especially outside perfectly managed cloud environments.
Disk pressure, power loss, partial network partitions, process crashes, and restarts are daily reality in IoT, edge, and hybrid systems. Yet many event pipelines assume stable infrastructure, heavy runtimes, or complex operational setups.
This article shares lessons from building a durability-first event log, designed to behave predictably under failure, with a focus on correctness, operational simplicity, and realistic constraints rather than maximum feature breadth.
The Core Problem: Failure Isn’t an Edge Case
In many real systems, especially those touching hardware or edge deployments, you can’t assume:
- stable network connectivity
- graceful shutdowns
- unlimited disk
- a dedicated ops team
- homogeneous x86 servers
Yet many popular event systems are optimized primarily for throughput and scale, with durability and recovery treated as secondary concerns or operationally expensive features.
From experience, the most painful incidents don’t come from lack of throughput they come from:
- unclear recovery semantics
- long restart times
- manual intervention after crashes
- partial data loss that’s hard to detect
The question that motivated this work was simple:
What would an event log look like if durability, recovery, and simplicity were the first constraints not optional features?
Design Principles
The system described here (Ayder) follows a few strict principles:
1. Durability by Default
Writes are acknowledged only after being safely persisted and replicated (configurable, but explicit). If a process is killed mid-write, the system must recover without data loss.
2. Crash Recovery Must Be Boring
A restart should not trigger rebalancing storms, operator playbooks, or manual cleanup. Recovery should be automatic and fast.
3. Operational Simplicity Matters
A single static binary, no JVM, no external coordinators, no client libraries required to get started. If you can curl, you can produce and consume events.
4. Measure the Worst Case, Not the Average
P99.999 latency and unclean shutdown behavior are more informative than peak throughput numbers.
Architecture Overview (High Level)
At its core, the system is:
- an append-only log with partitions and monotonically increasing offsets
- replicated via Raft consensus (3/5/7 node clusters)
- persisted using sealed append-only files (AOF)
- accessed through a plain HTTP API
No ZooKeeper, no KRaft controllers, no sidecars.
Clients:
- produce raw bytes via HTTP POST
- consume via offset-based pulls
- explicitly commit offsets
This explicitness is intentional. It avoids hidden magic and makes failure behavior visible.
Failure as a First-Class Test Case
Instead of relying on theoretical guarantees, the system ships with a Jepsen-style smoke test that can be run locally.
The test repeatedly:
- kills nodes with
SIGKILLmid-write - restarts them in random order
- introduces optional network delay and jitter
- verifies invariants
Invariants checked:
- no gaps in offsets
- no duplicates when idempotency keys are used
- per-partition ordering preserved
- committed offsets monotonic across restarts
If something breaks, the failure is reproducible. This has been more valuable than synthetic benchmarks alone.
Recovery Behavior in Practice
One of the most revealing tests involved a 3-node cluster with ~8 million offsets:
- A follower is killed mid-write
- Leader continues accepting writes
- Follower is restarted
- Follower replays its local AOF
- It requests missing offsets from the leader
- Leader streams the delta
- Cluster becomes fully healthy
Observed recovery time: ~40–50 seconds
No operator intervention. No manual reassignment.
This contrasts sharply with experiences where cluster restarts take hours or require human coordination.
Performance Under Real Constraints
Performance was measured under real network conditions, not loopback, and with durability enabled.
Cloud (x86) — 3-Node Cluster
- Sync-majority writes (2/3 nodes)
- ~50K msg/s with client P99 ≈ 3.5ms
- server P99.999 ≈ 1.2ms
The long client-side tail was primarily network/kernel scheduling. Server-side work remained consistently sub-2ms even at extreme percentiles.
ARM64 (Snapdragon X Elite, WSL2, Battery)
Perhaps the most surprising result came from running the same system on consumer ARM hardware:
- Snapdragon X Elite laptop
- WSL2 Ubuntu
- Running on battery
- 3-node cluster on a single machine
Result:
- ~106K msg/s
- server P99.999 ≈ 0.65ms
This reinforced a few observations:
- ARM64 is more than viable for server-style workloads
- efficient C code benefits significantly from modern ARM cores
- WSL2 overhead for async I/O is lower than often assumed
It also makes local HA testing far more accessible.
Why HTTP?
HTTP is not the fastest protocol on paper and that’s fine.
What HTTP provides:
- debuggability (
curl, logs, proxies) - no client SDK lock-in
- easier integration in constrained environments
- predictable behavior across languages
Measured results showed that HTTP parsing was not the bottleneck. The system spent more time waiting on disk sync and network replication than parsing requests.
In practice, this tradeoff improved operability far more than it hurt performance.
Where This Is Useful (and Where It Isn’t)
This approach is not ideal for every workload.
It does make sense for:
- edge → cloud pipelines
- device or gateway event ingestion
- systems where restart time matters more than raw throughput
- teams without dedicated infra operators
- environments where JVM-based stacks are heavy
It’s not intended to:
- replace existing Kafka deployments overnight
- act as a SQL database
- provide magic exactly-once semantics without client discipline
The goal is a predictable, durable core not maximal abstraction.
What I’m Looking for Next
At this stage, the most valuable input is not feature requests, but reality checks.
I’m looking for 2–3 teams willing to:
- sanity-check this approach against their real constraints
- share how they think about durability, recovery, and ops pain
- optionally run a small pilot or failure test
This is not a sales ask, and not a request to migrate production systems. Even a 20-minute conversation about constraints would be incredibly valuable.
Closing Thoughts
Most distributed systems look elegant until something crashes at the wrong time.
Building with failure as the default constraint changes design decisions dramatically from storage layout to APIs to recovery logic. The results may not be glamorous, but they’re often far more useful in practice.
If you’re operating or building event-driven systems under imperfect conditions, I’d love to compare notes.

Leave a Reply