Skip to main content

Monitoring LynxDB

LynxDB exposes comprehensive health and performance metrics through its REST API and CLI commands. This guide covers what to monitor and how to access metrics.

Health Check

The /health endpoint is designed for load balancers and container orchestrators:

curl http://localhost:3100/health
# {"status": "ok"}

Returns 200 OK when the server is ready to accept requests. Use this for:

  • Load balancer health checks
  • Kubernetes liveness/readiness probes
  • Uptime monitoring
# CLI health check
lynxdb health

Server Status

The GET /api/v1/stats endpoint returns detailed server metrics:

curl -s http://localhost:3100/api/v1/stats | jq .

CLI shorthand:

lynxdb status

# Output:
# LynxDB v0.1.0 -- uptime 2d 5h 30m -- healthy
#
# Storage: 1.2 GB
# Events: 3,456,789 total 123,456 today
# Segments: 42 Memtable: 8200 events
# Sources: nginx (45%), api-gateway (30%), postgres (25%)
# Oldest: 2025-01-08T10:30:00Z
# Indexes: 3

For machine-readable output:

lynxdb status --format json

Live Dashboard

The lynxdb top command provides a full-screen live TUI dashboard:

lynxdb top
lynxdb top --interval 5s

Shows four panels:

  • Ingest: Events/sec, events today, total events
  • Queries: Active queries, cache hit rate, materialized view count, tail sessions
  • Storage: Total size, part count, batcher-buffered events, index count
  • Sources: Bar chart of events by source

Press q or Ctrl+C to exit.

Key Metrics

Ingest Metrics

MetricDescriptionWhat to Watch
Events ingested/secCurrent ingest rateSudden drops indicate pipeline issues
Events todayEvents ingested since midnightCompare to baseline
Total eventsAll events in storageGrowth rate
Buffered eventsEvents waiting in the ingest batcherSustained growth means flush or compaction is falling behind

Storage Metrics

MetricDescriptionWhat to Watch
Total storage sizeDisk usage for all segmentsCapacity planning
Segment countNumber of .lsg segmentsHigh L0 count = compaction backlog
Buffered eventsIn-memory batcher depthShould return toward zero after flush cycles
Compaction backlogPending compaction workGrowing backlog = increase workers

Query Metrics

MetricDescriptionWhat to Watch
Active queriesCurrently executing queriesNear max_concurrent = bottleneck
Cache hit ratePercentage of queries served from cacheLow rate = increase cache_max_bytes
Query latency (p50, p99)Query execution timeSpikes indicate performance issues
Bloom filter skip rateSegments skipped by bloom filtersHigher is better

Tiering Metrics (S3)

MetricDescriptionWhat to Watch
Segments in hot tierSegments on local SSDCapacity planning
Segments in warm tierSegments in S3Storage costs
Segment cache hit rateLocal cache for warm segmentsLow rate = increase cache
Upload/download bytesS3 transfer volumeCost monitoring

Cluster Metrics

In cluster mode, additional metrics are available for monitoring distributed operations.

Shard Metrics

MetricDescriptionWhat to Watch
shard_activeShards in active stateShould match expected partition count
shard_drainingShards being drainedNon-zero during rebalance
shard_migratingShards being migratedNon-zero during rebalance or split
shard_splittingShards being splitNon-zero during hot partition split
shard_map_epochCurrent shard map versionMonotonically increasing

Rebalance Metrics

MetricDescriptionWhat to Watch
rebalance_totalTotal rebalances appliedIncreasing during topology changes
rebalance_move_totalTotal shard moves across all rebalancesLarge numbers indicate frequent topology changes
rebalance_duration_nsDuration of last rebalanceGrowing duration may indicate large clusters

Node Health Metrics

MetricDescriptionWhat to Watch
nodes_aliveNodes sending heartbeats normallyShould match cluster size
nodes_suspectNodes with missed heartbeatsNon-zero may indicate network issues
nodes_deadNodes declared deadShould be 0 in healthy cluster
leader_changes_totalRaft leader transitionsFrequent changes indicate instability

Split Metrics

MetricDescriptionWhat to Watch
split_totalTotal partition splits proposedIncreasing under hot-spot load
split_duration_nsDuration of last split

Meta Loss Metrics

MetricDescriptionWhat to Watch
meta_loss_duration_nsDuration of current/last meta-loss episodeShould be 0 normally
meta_loss_duplicate_partsDuplicate partitions detected during meta lossNon-zero indicates potential conflicts

Cache Statistics

lynxdb cache stats

# Output:
# Query Cache
# Hits: 12,456
# Misses: 3,789
# Hit Rate: 76.7%
# Entries: 1,234
# Size: 456 MB / 1.0 GB
# Evictions: 89
# Machine-readable
lynxdb cache stats --format json

If the hit rate is consistently below 50%, consider increasing storage.cache_max_bytes or storage.cache_ttl.

Diagnostics

The lynxdb doctor command runs a comprehensive health check:

lynxdb doctor

# Output:
# ok Binary v0.1.0 (linux/amd64, go1.25.4)
# ok Config /home/user/.config/lynxdb/config.yaml (valid)
# ok Data dir /var/lib/lynxdb (42 GB free)
# ok Server localhost:3100 (healthy, uptime 2d 5h)
# ok Events 3.4M total
# ok Storage 1.2 GB
# ok Retention 7d
# ok Completion zsh detected
#
# All checks passed.
# Machine-readable
lynxdb doctor --format json

Query Profiling

Profile individual queries to identify performance bottlenecks:

# Basic profiling
lynxdb query 'level=error | stats count by source' --analyze

# Full profiling with per-operator timing
lynxdb query 'level=error | stats count by source' --analyze full

# Trace-level profiling
lynxdb query 'level=error | stats count by source' --analyze trace

The --analyze output shows:

  • Segments scanned vs skipped (with skip reasons: bloom, time, index, stats)
  • Rows read vs filtered
  • Per-operator execution time
  • Memory usage

Monitoring with External Tools

Prometheus Scraping

Poll the /api/v1/stats endpoint at regular intervals:

# prometheus.yml
scrape_configs:
- job_name: 'lynxdb'
scrape_interval: 30s
metrics_path: '/api/v1/stats'
static_configs:
- targets: ['lynxdb:3100']

Kubernetes ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: lynxdb
namespace: lynxdb
spec:
selector:
matchLabels:
app: lynxdb
endpoints:
- port: http
path: /api/v1/stats
interval: 30s

Watching Metrics

Use the lynxdb watch command for quick metric monitoring:

# Watch error rate every 30 seconds
lynxdb watch 'level=error | stats count' --interval 30s --diff

For production deployments, monitor these at minimum:

  • /health endpoint -- uptime monitoring
  • Ingest rate -- detect pipeline failures
  • Disk usage and growth rate -- capacity planning
  • Query latency (p99) -- performance SLA
  • Active queries vs max_concurrent -- concurrency saturation
  • Cache hit rate -- query performance optimization
  • Buffered events -- flush health
  • Compaction backlog -- storage health

Cluster-specific items (when running in cluster mode):

  • nodes_alive -- all nodes healthy
  • nodes_dead -- should be 0
  • shard_draining + shard_migrating -- rebalance in progress
  • leader_changes_total -- Raft leader stability
  • meta_loss_duration_ns -- meta quorum health

Next Steps