Building a Durability-First Event Log That Survives Real Failures

Aydarbek Romanuly Last Updated: February 19, 2026

Collected at: https://www.iotforall.com/durability-first-iot

Modern systems generate streams of events everywhere: devices at the edge, gateways, backend services, and cloud workloads. What often gets overlooked is that failure is the normal state, not the exception especially outside perfectly managed cloud environments.

Disk pressure, power loss, partial network partitions, process crashes, and restarts are daily reality in IoT, edge, and hybrid systems. Yet many event pipelines assume stable infrastructure, heavy runtimes, or complex operational setups.

This article shares lessons from building a durability-first event log, designed to behave predictably under failure, with a focus on correctness, operational simplicity, and realistic constraints rather than maximum feature breadth.

The Core Problem: Failure Isn’t an Edge Case

In many real systems, especially those touching hardware or edge deployments, you can’t assume:

stable network connectivity
graceful shutdowns
unlimited disk
a dedicated ops team
homogeneous x86 servers

Yet many popular event systems are optimized primarily for throughput and scale, with durability and recovery treated as secondary concerns or operationally expensive features.

From experience, the most painful incidents don’t come from lack of throughput they come from:

unclear recovery semantics
long restart times
manual intervention after crashes
partial data loss that’s hard to detect

The question that motivated this work was simple:

What would an event log look like if durability, recovery, and simplicity were the first constraints not optional features?

Design Principles

The system described here (Ayder) follows a few strict principles:

1. Durability by Default

Writes are acknowledged only after being safely persisted and replicated (configurable, but explicit). If a process is killed mid-write, the system must recover without data loss.

2. Crash Recovery Must Be Boring

A restart should not trigger rebalancing storms, operator playbooks, or manual cleanup. Recovery should be automatic and fast.

3. Operational Simplicity Matters

A single static binary, no JVM, no external coordinators, no client libraries required to get started. If you can curl, you can produce and consume events.

4. Measure the Worst Case, Not the Average

P99.999 latency and unclean shutdown behavior are more informative than peak throughput numbers.

Architecture Overview (High Level)

At its core, the system is:

an append-only log with partitions and monotonically increasing offsets
replicated via Raft consensus (3/5/7 node clusters)
persisted using sealed append-only files (AOF)
accessed through a plain HTTP API

No ZooKeeper, no KRaft controllers, no sidecars.

Clients:

produce raw bytes via HTTP POST
consume via offset-based pulls
explicitly commit offsets

This explicitness is intentional. It avoids hidden magic and makes failure behavior visible.

Failure as a First-Class Test Case

Instead of relying on theoretical guarantees, the system ships with a Jepsen-style smoke test that can be run locally.

The test repeatedly:

kills nodes with SIGKILL mid-write
restarts them in random order
introduces optional network delay and jitter
verifies invariants

Invariants checked:

no gaps in offsets
no duplicates when idempotency keys are used
per-partition ordering preserved
committed offsets monotonic across restarts

If something breaks, the failure is reproducible. This has been more valuable than synthetic benchmarks alone.

Recovery Behavior in Practice

One of the most revealing tests involved a 3-node cluster with ~8 million offsets:

A follower is killed mid-write
Leader continues accepting writes
Follower is restarted
Follower replays its local AOF
It requests missing offsets from the leader
Leader streams the delta
Cluster becomes fully healthy

Observed recovery time: ~40–50 seconds
No operator intervention. No manual reassignment.

This contrasts sharply with experiences where cluster restarts take hours or require human coordination.

Performance Under Real Constraints

Performance was measured under real network conditions, not loopback, and with durability enabled.

Cloud (x86) — 3-Node Cluster

Sync-majority writes (2/3 nodes)
~50K msg/s with client P99 ≈ 3.5ms
server P99.999 ≈ 1.2ms

The long client-side tail was primarily network/kernel scheduling. Server-side work remained consistently sub-2ms even at extreme percentiles.

ARM64 (Snapdragon X Elite, WSL2, Battery)

Perhaps the most surprising result came from running the same system on consumer ARM hardware:

Snapdragon X Elite laptop
WSL2 Ubuntu
Running on battery
3-node cluster on a single machine

Result:

~106K msg/s
server P99.999 ≈ 0.65ms

This reinforced a few observations:

ARM64 is more than viable for server-style workloads
efficient C code benefits significantly from modern ARM cores
WSL2 overhead for async I/O is lower than often assumed

It also makes local HA testing far more accessible.

Why HTTP?

HTTP is not the fastest protocol on paper and that’s fine.

What HTTP provides:

debuggability (curl, logs, proxies)
no client SDK lock-in
easier integration in constrained environments
predictable behavior across languages

Measured results showed that HTTP parsing was not the bottleneck. The system spent more time waiting on disk sync and network replication than parsing requests.

In practice, this tradeoff improved operability far more than it hurt performance.

Where This Is Useful (and Where It Isn’t)

This approach is not ideal for every workload.

It does make sense for:

edge → cloud pipelines
device or gateway event ingestion
systems where restart time matters more than raw throughput
teams without dedicated infra operators
environments where JVM-based stacks are heavy

It’s not intended to:

replace existing Kafka deployments overnight
act as a SQL database
provide magic exactly-once semantics without client discipline

The goal is a predictable, durable core not maximal abstraction.

What I’m Looking for Next

At this stage, the most valuable input is not feature requests, but reality checks.

I’m looking for 2–3 teams willing to:

sanity-check this approach against their real constraints
share how they think about durability, recovery, and ops pain
optionally run a small pilot or failure test

This is not a sales ask, and not a request to migrate production systems. Even a 20-minute conversation about constraints would be incredibly valuable.

Closing Thoughts

Most distributed systems look elegant until something crashes at the wrong time.

Building with failure as the default constraint changes design decisions dramatically from storage layout to APIs to recovery logic. The results may not be glamorous, but they’re often far more useful in practice.

If you’re operating or building event-driven systems under imperfect conditions, I’d love to compare notes.