Storage Internals
Philosophy: The Log Is the Database
Section titled “Philosophy: The Log Is the Database”Most databases maintain two copies of every write: a write-ahead log (WAL) for durability and a primary data structure (B-tree, LSM tree, heap file) for queries. Flo doesn’t do this. The log is the primary data structure.
The Unified Append Log captures every mutation — KV puts, queue enqueues, stream appends, time-series writes — as a sequenced, typed entry. The Raft consensus log and the storage log are the same thing. There is no separate WAL.
Everything else — the KV hash table, the queue’s priority heap, the time-series columnar buffers — is a projection: a derived view rebuilt deterministically from the log. If you deleted every projection and replayed the log from the beginning, you’d get back to exactly the same state.
Unified Append Log (UAL)
Section titled “Unified Append Log (UAL)”The UAL is a sequence of entries, each identified by a monotonically increasing 64-bit index.
Entry Format
Section titled “Entry Format”Every UAL entry has a 40-byte header followed by a variable-length payload:
┌──────────┬──────────┬───────┬───────────┬────────────┬─────────────┬───────┐│ CRC32C │ entry_ │ flags │ raft_term │ raft_index │ timestamp │ pay- ││ (4B) │ type(2B) │ (2B) │ (8B) │ (8B) │ _ns (8B) │ load │└──────────┴──────────┴───────┴───────────┴────────────┴─────────────┴───────┘The CRC covers the header and payload together. Payload layout varies by entry type — each projection knows how to parse its own payloads.
Three-Tier Storage
Section titled “Three-Tier Storage”| Tier | Implementation | Purpose |
|---|---|---|
| Hot | mmap’d ring buffer in RAM (default 64 MB) | Recent entries for active reads. O(1) by index. |
| Warm | Sealed disk segments (.flseg files), memory-mapped | Historical data, bounded by local disk space. |
| Cold | S3/GCS/Azure Blob (planned) | Long-term archival, on-demand download. |
When entries age out of the hot ring, they’re flushed to warm segments. A warm store hash map keeps payload copies in memory (default 32 MB) so reads don’t have to go to disk immediately.
Segment Format
Section titled “Segment Format”Each sealed segment is a .flseg file:
[SegmentHeader] magic "FLOSEG\0\0", version, segment_id, index range, entry count[Entry 0..N] Sequential entries[SparseIndex] Sampled every ~256 entries: index → file offset[SegmentFooter] index_offset, index_count, crc32cThe sparse index enables binary search by UAL index without scanning every entry. The footer CRC covers the entire segment — corruption is detected on read.
Projections
Section titled “Projections”Projections are specialized data structures that consume UAL entries and maintain queryable state.
KV Projection
Section titled “KV Projection”- Hash table mapping keys to values, plus MVCC version chains (ring buffer, default 8 versions per key)
- TTL tracking for automatic expiry
- Reads are O(1) hash table lookups — no Raft round-trip needed
- Historical lookups walk the version chain
Queue Projection
Section titled “Queue Projection”The old system stored each message as 8 separate KV entries (~1.5 KB overhead per message). The new design uses native structures:
- Ready heap — min-heap ordered by priority, 16 bytes per node
- Lease tracker — maps sequences to lease expiry timestamps
- DLQ state — tracks retry counts, moves failures to dead-letter queue
Per-message overhead: ~64 bytes (23× reduction from the old design).
Stream Projection
Section titled “Stream Projection”Streams have no traditional projection. Stream records are UAL entries with entry_type = stream_append. Reading a stream is reading a range of UAL entries — the log index is the stream offset. Zero copy, no derived state.
Consumer group offsets are stored as KV entries (prefixed with cg:), so the KV projection handles that state.
Time-Series Projection
Section titled “Time-Series Projection”TS data has no dedicated disk files. All ts_write entries are appended to the shared UAL (same .flseg segments as KV, queue, and stream data).
- Write buffers — per-series, per-field in-memory buffers (1024 points capacity)
- Block index — in-memory metadata: min/max timestamp, point count, UAL index range
- Series metadata lives in KV projection under
_ts:meta:*keys
When a write buffer fills, flushBuffer() records a block metadata entry (timestamps + UAL index range) and discards the raw points. Reconstructing historical block data requires replaying the UAL in [ual_index_start..ual_index_end]. This means hot reads (from the write buffer) are fast, while cold reads (from flushed blocks) require UAL replay.
Projection Router
Section titled “Projection Router”Single fan-out point between the UAL and projections. Routes committed entries by type:
| Entry Types | Destination |
|---|---|
kv_put, kv_delete, kv_batch, cg_* | KV Projection |
queue_enqueue, queue_ack, queue_requeue | Queue Projection |
ts_write, ts_write_batch | TS Projection |
stream_append | Nothing (data stays in UAL) |
raft_noop, raft_config | Nothing (consensus layer) |
An applied_index guard ensures replayed entries are silently skipped.
Snapshots
Section titled “Snapshots”Projections live in RAM. Without snapshots, recovery requires replaying the entire UAL from the beginning.
Snapshot Format (.fsnap)
Section titled “Snapshot Format (.fsnap)”┌─────────────────────────────────┐│ Header (64 bytes) ││ magic: "FLO_SNP\0" ││ ual_index (snapshot point) ││ raft_term, section_count │├─────────────────────────────────┤│ Section: KV (type=0x01) ││ Section: Queue (type=0x02) ││ Section: TS (type=0x03) ││ Section: Stream (type=0x04) │├─────────────────────────────────┤│ Footer (16 bytes) ││ crc32c, magic: "FLO_SNE" │└─────────────────────────────────┘Lifecycle
Section titled “Lifecycle”- Serialize all four projections at the current
applied_index - Write to
.fsnap.tmp fdatasyncthe temp file- Atomic rename
.fsnap.tmp→.fsnap - Update
MANIFEST - Old UAL segments with indices ≤
snapshot_indexcan now be compacted
The atomic rename guarantees crash safety — if the process crashes before the rename, the previous snapshot remains valid.
Crash Safety
Section titled “Crash Safety”| Scenario | What You Lose | Why It’s OK |
|---|---|---|
| Crash during UAL append | The uncommitted entry | Wasn’t acknowledged to client |
| Crash during snapshot write | The in-progress snapshot | Previous snapshot still valid |
| Crash after snapshot, before compaction | Nothing | Snapshot valid, extra UAL entries harmless |
| Power loss (no fsync) | At most ~1ms of entries | Bounded by group commit interval |
Recovery
Section titled “Recovery”When a node restarts, each partition recovers:
- Load snapshot — read MANIFEST, validate CRC, deserialize all projections
- Open UAL — discover warm segments on disk
- Replay — feed UAL entries from
snapshot_index + 1through ProjectionRouter (same code path as live operation) - Load cold manifest — metadata only, no data fetched
- Ready for traffic
If no snapshot exists, recovery replays the entire UAL from the first segment … slow, but correct.
Memory Controller
Section titled “Memory Controller”Each shard gets a fixed memory budget. Default split for a 2 GB shard:
| Component | Share | Default Budget |
|---|---|---|
| UAL Hot Ring | 12.5% | 256 MB |
| KV Projection | 37.5% | 768 MB |
| Queue Projection | 6.25% | 128 MB |
| TS Projection | 12.5% | 256 MB |
| I/O Buffers | 6.25% | 128 MB |
| Snapshot Buffer | 3.125% | 64 MB |
| Warm Store | 6.25% | 128 MB |
| Reserve | 15.625% | 320 MB |
Backpressure Levels
Section titled “Backpressure Levels”- Eviction — ask the component to free memory (drop old MVCC versions, spill to disk)
- Reserve borrow — borrow from the reserve pool, tracked and repaid
- Client backpressure — return
ShardMemoryPressureas a retriable error - Hard reject — write rejected immediately to prevent OOM
Directory Layout
Section titled “Directory Layout”{data_dir}/├── SYSTEM # Topology lock (shards, partitions, version)├── 00000/ # Shard 0 (zero-padded 5 digits)│ ├── MANIFEST # Latest snapshot ref + cold segment index│ ├── segs/│ │ ├── 0000000001.flseg # UAL segment (10-digit zero-padded first index)│ │ ├── 0000000257.flseg│ │ └── *.flseg.tmp # Transient (in-flight writes only)│ ├── snaps/│ │ ├── 0000001000-1709234567.fsnap # Snapshot at UAL index 1000│ │ └── *.fsnap.tmp # Transient (in-flight writes only)│ └── cold.fcold # (optional) Cold tier manifest├── 00001/│ ├── MANIFEST│ ├── segs/│ └── snaps/└── ...{shard_count - 1}/SYSTEM
Section titled “SYSTEM”Written once on first boot. If shards or partitions don’t match on restart, the node refuses to start with TopologyMismatch. Format:
{ "shards": 8, "partitions": 256, "created_at": 1772033477, "version": "1.0.0"}MANIFEST
Section titled “MANIFEST”One per shard — a JSON file tracking the latest snapshot pointer and cold segment inventory. No per-directory manifests to coordinate.
Atomic Writes
Section titled “Atomic Writes”Both segments and snapshots are written atomically via .tmp → fdatasync → rename. If the process crashes before the rename, the previous file remains valid.