Clustering
Overview
Section titled “Overview”Flo clusters consist of multiple nodes that coordinate via:
- SWIM gossip — failure detection, membership protocol, metadata dissemination
- Raft consensus — per-partition replication with leader election
- Controller Raft — runs on Shard 0, manages the global partition table
- Partition Table — maps each partition to its owning node(s)
Cluster Formation
Section titled “Cluster Formation”Seed Nodes
Section titled “Seed Nodes”Each node needs at least one seed address to join the cluster:
[cluster]seeds = ["10.0.1.10:9000", "10.0.1.11:9000", "10.0.1.12:9000"]node_id = 1Or via environment variables:
FLO_CLUSTER_SEEDS=10.0.1.10:9000,10.0.1.11:9000FLO_NODE_ID=1Bootstrap
Section titled “Bootstrap”The first node in a cluster bootstraps automatically when no seeds respond. Subsequent nodes join by contacting any existing member.
# First node (bootstraps)flo server start --node-id 1 --port 9000
# Second node (joins via seed)flo server start --node-id 2 --port 9000 --seeds 10.0.1.10:9000
# Third nodeflo server start --node-id 3 --port 9000 --seeds 10.0.1.10:9000,10.0.1.11:9000Partitioning
Section titled “Partitioning”Data is distributed across nodes using hash-based partitioning:
hash(key) → partition_id (0..N)partition_table[partition_id] → (leader_node, replica_nodes)The default partition count is 64. Each partition is a Raft group with a configurable replication factor (default: 3).
Partition Assignment
Section titled “Partition Assignment”The Controller Raft on Shard 0 manages partition assignments:
| Event | Action |
|---|---|
| Node joins | Rebalance partitions to include new node |
| Node leaves (graceful) | Migrate partitions before shutdown |
| Node fails (detected by SWIM) | Raft elects new leader from remaining replicas |
| Manual rebalance | Trigger via REST API or CLI |
Replication
Section titled “Replication”Each partition is a Raft group. The leader handles all writes; followers replicate:
Client → Leader Node → Raft propose → Replicate to quorum (N/2+1 nodes) → Commit → Apply to UAL → Update Projections → Respond to clientReplication Factor
Section titled “Replication Factor”[cluster]replication_factor = 3 # 1 leader + 2 followers per partitionWith replication_factor = 3, the cluster tolerates 1 node failure per partition without data loss. With replication_factor = 5, it tolerates 2.
Failure Handling
Section titled “Failure Handling”SWIM Protocol
Section titled “SWIM Protocol”SWIM (Scalable Weakly-consistent Infection-style Membership) detects failures in O(log N) time:
- Each node periodically pings a random peer
- If no response, requests indirect pings through other members
- If still no response, marks the node as suspect
- After a configurable timeout, marks as failed
Parameters:
| Setting | Default | Description |
|---|---|---|
gossip.interval_ms | 1000 | Ping interval |
gossip.suspect_timeout_ms | 5000 | Time before suspect → failed |
gossip.indirect_checks | 3 | Number of indirect ping relays |
Leader Election
Section titled “Leader Election”When a Raft leader fails:
- SWIM detects the failure
- Followers notice missing heartbeats
- A follower with the most up-to-date log starts a pre-vote
- If pre-vote succeeds, starts an election
- New leader elected and begins accepting writes
Typical failover time: 2-5 seconds (SWIM detection + Raft election).
Monitoring
Section titled “Monitoring”Cluster Status
Section titled “Cluster Status”flo cluster statusPrometheus Metrics
Section titled “Prometheus Metrics”Key cluster metrics on port 9001:
| Metric | Description |
|---|---|
flo_cluster_nodes | Number of nodes in the cluster |
flo_cluster_partitions | Total partition count |
flo_raft_leader_elections_total | Leader election count |
flo_raft_replication_lag | Replication lag in entries |
flo_gossip_members | Active SWIM members |
Recommended Topologies
Section titled “Recommended Topologies”Development
Section titled “Development”1 node, shard_count = 1, no replication.
Small Production
Section titled “Small Production”3 nodes, replication_factor = 3. Tolerates 1 node failure.
Large Production
Section titled “Large Production”5+ nodes, replication_factor = 3, 128+ partitions. Allows rolling upgrades with maintained quorum.
[cluster]seeds = ["10.0.1.10:9000", "10.0.1.11:9000", "10.0.1.12:9000"]replication_factor = 3partition_count = 128
[node]shard_count = 8 # Match CPU core count