Monitoring LynxDB

LynxDB exposes comprehensive health and performance metrics through its REST API and CLI commands. This guide covers what to monitor and how to access metrics.

Health Check

The /health endpoint is designed for load balancers and container orchestrators:

curl http://localhost:3100/health
# {"status": "ok"}

Returns 200 OK when the server is ready to accept requests. Use this for:

Load balancer health checks
Kubernetes liveness/readiness probes
Uptime monitoring

# CLI health check
lynxdb health

Server Status

The GET /api/v1/stats endpoint returns detailed server metrics:

curl -s http://localhost:3100/api/v1/stats | jq .

CLI shorthand:

lynxdb status

# Output:
#   LynxDB v0.1.0 -- uptime 2d 5h 30m -- healthy
#
#   Storage:     1.2 GB
#   Events:      3,456,789 total    123,456 today
#   Segments:    42    Memtable: 8200 events
#   Sources:     nginx (45%), api-gateway (30%), postgres (25%)
#   Oldest:      2025-01-08T10:30:00Z
#   Indexes:     3

For machine-readable output:

lynxdb status --format json

Live Dashboard

The lynxdb top command provides a full-screen live TUI dashboard:

lynxdb top
lynxdb top --interval 5s

Shows four panels:

Ingest: Events/sec, events today, total events
Queries: Active queries, cache hit rate, materialized view count, tail sessions
Storage: Total size, part count, batcher-buffered events, index count
Sources: Bar chart of events by source

Press q or Ctrl+C to exit.

Key Metrics

Ingest Metrics

Metric	Description	What to Watch
Events ingested/sec	Current ingest rate	Sudden drops indicate pipeline issues
Events today	Events ingested since midnight	Compare to baseline
Total events	All events in storage	Growth rate
Buffered events	Events waiting in the ingest batcher	Sustained growth means flush or compaction is falling behind

Storage Metrics

Metric	Description	What to Watch
Total storage size	Disk usage for all segments	Capacity planning
Segment count	Number of `.lsg` segments	High L0 count = compaction backlog
Buffered events	In-memory batcher depth	Should return toward zero after flush cycles
Compaction backlog	Pending compaction work	Growing backlog = increase workers

Query Metrics

Metric	Description	What to Watch
Active queries	Currently executing queries	Near `max_concurrent` = bottleneck
Cache hit rate	Percentage of queries served from cache	Low rate = increase `cache_max_bytes`
Query latency (p50, p99)	Query execution time	Spikes indicate performance issues
Bloom filter skip rate	Segments skipped by bloom filters	Higher is better

Tiering Metrics (S3)

Metric	Description	What to Watch
Segments in hot tier	Segments on local SSD	Capacity planning
Segments in warm tier	Segments in S3	Storage costs
Segment cache hit rate	Local cache for warm segments	Low rate = increase cache
Upload/download bytes	S3 transfer volume	Cost monitoring

Cluster Metrics

In cluster mode, additional metrics are available for monitoring distributed operations.

Shard Metrics

Metric	Description	What to Watch
`shard_active`	Shards in active state	Should match expected partition count
`shard_draining`	Shards being drained	Non-zero during rebalance
`shard_migrating`	Shards being migrated	Non-zero during rebalance or split
`shard_splitting`	Shards being split	Non-zero during hot partition split
`shard_map_epoch`	Current shard map version	Monotonically increasing

Rebalance Metrics

Metric	Description	What to Watch
`rebalance_total`	Total rebalances applied	Increasing during topology changes
`rebalance_move_total`	Total shard moves across all rebalances	Large numbers indicate frequent topology changes
`rebalance_duration_ns`	Duration of last rebalance	Growing duration may indicate large clusters

Node Health Metrics

Metric	Description	What to Watch
`nodes_alive`	Nodes sending heartbeats normally	Should match cluster size
`nodes_suspect`	Nodes with missed heartbeats	Non-zero may indicate network issues
`nodes_dead`	Nodes declared dead	Should be 0 in healthy cluster
`leader_changes_total`	Raft leader transitions	Frequent changes indicate instability

Split Metrics

Metric	Description	What to Watch
`split_total`	Total partition splits proposed	Increasing under hot-spot load
`split_duration_ns`	Duration of last split

Meta Loss Metrics

Metric	Description	What to Watch
`meta_loss_duration_ns`	Duration of current/last meta-loss episode	Should be 0 normally
`meta_loss_duplicate_parts`	Duplicate partitions detected during meta loss	Non-zero indicates potential conflicts

Cache Statistics

lynxdb cache stats

# Output:
#   Query Cache
#   Hits:       12,456
#   Misses:     3,789
#   Hit Rate:   76.7%
#   Entries:    1,234
#   Size:       456 MB / 1.0 GB
#   Evictions:  89

# Machine-readable
lynxdb cache stats --format json

If the hit rate is consistently below 50%, consider increasing storage.cache_max_bytes or storage.cache_ttl.

Diagnostics

The lynxdb doctor command runs a comprehensive health check:

lynxdb doctor

# Output:
#   ok Binary        v0.1.0 (linux/amd64, go1.25.4)
#   ok Config        /home/user/.config/lynxdb/config.yaml (valid)
#   ok Data dir      /var/lib/lynxdb (42 GB free)
#   ok Server        localhost:3100 (healthy, uptime 2d 5h)
#   ok Events        3.4M total
#   ok Storage       1.2 GB
#   ok Retention     7d
#   ok Completion    zsh detected
#
#   All checks passed.

# Machine-readable
lynxdb doctor --format json

Query Profiling

Profile individual queries to identify performance bottlenecks:

# Basic profiling
lynxdb query 'level=error | stats count by source' --analyze

# Full profiling with per-operator timing
lynxdb query 'level=error | stats count by source' --analyze full

# Trace-level profiling
lynxdb query 'level=error | stats count by source' --analyze trace

The --analyze output shows:

Segments scanned vs skipped (with skip reasons: bloom, time, index, stats)
Rows read vs filtered
Per-operator execution time
Memory usage

Monitoring with External Tools

Prometheus Scraping

Poll the /api/v1/stats endpoint at regular intervals:

# prometheus.yml
scrape_configs:
  - job_name: 'lynxdb'
    scrape_interval: 30s
    metrics_path: '/api/v1/stats'
    static_configs:
      - targets: ['lynxdb:3100']

Kubernetes ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: lynxdb
  namespace: lynxdb
spec:
  selector:
    matchLabels:
      app: lynxdb
  endpoints:
    - port: http
      path: /api/v1/stats
      interval: 30s

Watching Metrics

Use the lynxdb watch command for quick metric monitoring:

# Watch error rate every 30 seconds
lynxdb watch 'level=error | stats count' --interval 30s --diff

Recommended Monitoring Checklist

For production deployments, monitor these at minimum:

/health endpoint -- uptime monitoring
Ingest rate -- detect pipeline failures
Disk usage and growth rate -- capacity planning
Query latency (p99) -- performance SLA
Active queries vs max_concurrent -- concurrency saturation
Cache hit rate -- query performance optimization
Buffered events -- flush health
Compaction backlog -- storage health

Cluster-specific items (when running in cluster mode):

nodes_alive -- all nodes healthy
nodes_dead -- should be 0
shard_draining + shard_migrating -- rebalance in progress
leader_changes_total -- Raft leader stability
meta_loss_duration_ns -- meta quorum health

Next Steps

Performance Tuning -- optimize based on metrics
Troubleshooting -- diagnose common issues
Retention Policies -- manage data lifecycle
Query Settings -- tune concurrency and timeouts

Health Check​

Server Status​

Live Dashboard​

Key Metrics​

Ingest Metrics​

Storage Metrics​

Query Metrics​

Tiering Metrics (S3)​

Cluster Metrics​

Shard Metrics​

Rebalance Metrics​

Node Health Metrics​

Split Metrics​

Meta Loss Metrics​

Cache Statistics​

Diagnostics​

Query Profiling​

Monitoring with External Tools​

Prometheus Scraping​

Kubernetes ServiceMonitor​

Watching Metrics​

Recommended Monitoring Checklist​

Next Steps​