Skip to main content

Large Cluster (10-1000+ Nodes)

For large-scale deployments, LynxDB supports role splitting. Each node runs one specific role, allowing independent scaling of metadata management, write throughput, and query concurrency.

Current implementation status

The repo includes the core cluster building blocks for this topology, but the operational surface for large separated-role clusters is still broader than the small-cluster path. Treat this guide as advanced deployment guidance and verify failover, rebalancing, and shard-routing behavior in staging before production use.

Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Meta Nodes │ │ Ingest Nodes │ │ Query Nodes │
│ (3-5, Raft) │ │ (N, stateless│ │ (M, stateless│
│ │ │ + batcher) │ │ + cache) │
│ - Shard map │ │ - Batcher │ │ - Scatter- │
│ - Node reg │ │ - Memtable │ │ gather │
│ - Leases │ │ - Flush→S3 │ │ - Partial │
│ │ │ - Replicate │ │ merge │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└───────────────────┼───────────────────┘
┌─────┴─────┐
│ S3/MinIO │
│ (source │
│ of truth)│
└───────────┘

Roles

RoleScales withCountResponsibility
MetaCluster size3-5 (fixed)Raft consensus, shard map, node registry, leader leases, field catalog, source registry
IngestWrite throughputN nodesAsync batcher, direct-to-part writes, S3 upload, batcher replication
QueryQuery concurrencyM nodesScatter-gather, partial aggregation merge, result caching

Meta Node Configuration

Meta nodes run the Raft consensus protocol. You need 3 for production (tolerates 1 failure) or 5 (tolerates 2 failures). Meta nodes are lightweight and do not store log data.

# /etc/lynxdb/config.yaml (meta-1, meta-2, meta-3)
listen: "0.0.0.0:3100"
data_dir: "/var/lib/lynxdb"
log_level: "info"

cluster:
node_id: "meta-1" # Unique per node
roles: [meta]
seeds:
- "meta-1.example.com:9400"
- "meta-2.example.com:9400"
- "meta-3.example.com:9400"

Ingest Node Configuration

Ingest nodes receive data, buffer events in the async batcher, write parts directly to immutable .lsg files, upload them to S3, and replicate batches to ISR peers via gRPC streaming. They are stateless after flush -- if an ingest node dies, its shards are reassigned and replicated batches are available on ISR peers.

# /etc/lynxdb/config.yaml (ingest nodes)
listen: "0.0.0.0:3100"
data_dir: "/var/lib/lynxdb"
log_level: "info"

cluster:
enabled: true
node_id: "ingest-1" # Unique per node
roles: [ingest]
seeds:
- "meta-1.example.com:9400"
- "meta-2.example.com:9400"
- "meta-3.example.com:9400"
virtual_partition_count: 1024
time_bucket_size: "24h"
replication_factor: 3
ack_level: "one"
heartbeat_interval: "5s"
lease_duration: "10s"

storage:
s3_bucket: "my-lynxdb-logs"
s3_region: "us-east-1"
compression: "lz4"
flush_threshold: "512mb"
compaction_workers: 4

ingest:
max_body_size: "50mb"

Query Node Configuration

Query nodes execute distributed queries using scatter-gather. They pull segments from S3 (with a local cache) and merge partial aggregation results from shards.

# /etc/lynxdb/config.yaml (query nodes)
listen: "0.0.0.0:3100"
data_dir: "/var/lib/lynxdb"
log_level: "info"

cluster:
node_id: "query-1" # Unique per node
roles: [query]
seeds:
- "meta-1.example.com:9400"
- "meta-2.example.com:9400"
- "meta-3.example.com:9400"

storage:
s3_bucket: "my-lynxdb-logs"
s3_region: "us-east-1"
segment_cache_size: "50gb"
cache_max_bytes: "8gb"

query:
max_concurrent: 50
max_query_runtime: "30m"

Distributed Query Execution

The optimizer automatically splits queries for distributed execution:

"source=nginx | where status>=500 | stats count by uri | sort -count | head 10"

Shard-level (pushed): | where status>=500 | stats count by uri (partial agg)
Coordinator (merged): | sort -count | head 10

Operators that push down to shards: scan, filter, eval, partial stats, TopK. Operators that run on the coordinator: sort, join, global dedup, streamstats.

Network Architecture

                    ┌─────────────────────────────────────┐
│ Load Balancer │
│ (ingest LB) (query LB) │
└────────┬──────────────┬──────────────┘
│ │
┌──────────────┤ ├──────────────┐
│ │ │ │
┌─────┴────┐ ┌─────┴────┐ ┌─────┴────┐ ┌─────┴────┐
│Ingest-1 │ │Ingest-2 │ │ Query-1 │ │ Query-2 │
│Ingest-3 │ │Ingest-4 │ │ Query-3 │ │ Query-4 │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
└──────────────┼──────────────┼──────────────┘
│ │
┌───────┴───┐ ┌───────┴───┐
│ Meta-1 │ │ Meta-2 │
│ Meta-3 │ │ │
└───────────┘ └───────────┘

Required Ports

PortProtocolDirectionPurpose
3100TCPInboundHTTP API (ingest, query, health)
9400TCPBetween nodesCluster communication (Raft, shard rebalancing, replication)

Load Balancer Setup

Use separate load balancers (or URL-based routing) for ingest and query traffic:

  • Ingest LB routes POST /api/v1/ingest* to ingest nodes
  • Query LB routes all other traffic to query nodes
  • Health check: GET /health on port 3100

Scaling Guidelines

Ingest Nodes

Scale ingest nodes based on write throughput:

Events/secRecommended Ingest Nodes
100K2-3
500K5-10
1M10-20
5M+50+

Query Nodes

Scale query nodes based on concurrent query load:

Concurrent QueriesRecommended Query Nodes
102-3
505-10
20020-30
500+50+

Meta Nodes

Meta nodes are lightweight. 3 nodes handle clusters up to ~500 nodes. Use 5 for larger clusters.

Node Failure Recovery

ScenarioImpactRecovery Time
Ingest node failureShards reassigned to surviving ingest nodes~25 seconds
Query node failureLoad balancer routes to surviving query nodesImmediate (health check)
Meta node failure (1 of 3)Raft quorum maintained, cluster operates normallyAutomatic
Meta node failure (2 of 3)Raft quorum lost, no new shard assignmentsBring 1 meta node back
Meta quorum loss (all)Ingest continues in degraded mode for meta_loss_timeout (30s)Restore meta quorum

S3 is the source of truth. No data is lost when ingest nodes fail because batches are replicated to ISR peers.

Example: 50-Node Cluster

Meta:   3 nodes (m5.large)
Ingest: 20 nodes (c5.2xlarge, 8 vCPU, 16GB RAM)
Query: 27 nodes (r5.2xlarge, 8 vCPU, 64GB RAM)
S3: Single bucket, us-east-1

Estimated capacity:

  • Ingest: ~2M events/sec
  • Query: ~100 concurrent queries
  • Storage: Unlimited (S3)
  • Hot cache: ~1.3TB across query nodes

Next Steps