Skip to main content

Monitoring LynxDB

LynxDB exposes comprehensive health and performance metrics through its REST API and CLI commands. This guide covers what to monitor, how to access metrics, and how to set up alerts.

Health Check

The /health endpoint is designed for load balancers and container orchestrators:

curl http://localhost:3100/health
# {"status": "ok"}

Returns 200 OK when the server is ready to accept requests. Use this for:

  • Load balancer health checks
  • Kubernetes liveness/readiness probes
  • Uptime monitoring
# CLI health check
lynxdb health

Server Status

The GET /api/v1/stats endpoint returns detailed server metrics:

curl -s http://localhost:3100/api/v1/stats | jq .

CLI shorthand:

lynxdb status

# Output:
# LynxDB v0.1.0 -- uptime 2d 5h 30m -- healthy
#
# Storage: 1.2 GB
# Events: 3,456,789 total 123,456 today
# Segments: 42 Memtable: 8200 events
# Sources: nginx (45%), api-gateway (30%), postgres (25%)
# Oldest: 2025-01-08T10:30:00Z
# Indexes: 3

For machine-readable output:

lynxdb status --format json

Live Dashboard

The lynxdb top command provides a full-screen live TUI dashboard:

lynxdb top
lynxdb top --interval 5s

Shows four panels:

  • Ingest: Events/sec, events today, total events
  • Queries: Active queries, cache hit rate, materialized view count, tail sessions
  • Storage: Total size, segment count, memtable size, index count
  • Sources: Bar chart of events by source

Press q or Ctrl+C to exit.

Key Metrics

Ingest Metrics

MetricDescriptionWhat to Watch
Events ingested/secCurrent ingest rateSudden drops indicate pipeline issues
Events todayEvents ingested since midnightCompare to baseline
Total eventsAll events in storageGrowth rate
WAL sizeCurrent WAL sizeGrowing WAL means flush is slow

Storage Metrics

MetricDescriptionWhat to Watch
Total storage sizeDisk usage for all segmentsCapacity planning
Segment countNumber of .lsg segmentsHigh L0 count = compaction backlog
Memtable sizeIn-memory buffer sizeShould stay below flush_threshold
Compaction backlogPending compaction workGrowing backlog = increase workers

Query Metrics

MetricDescriptionWhat to Watch
Active queriesCurrently executing queriesNear max_concurrent = bottleneck
Cache hit ratePercentage of queries served from cacheLow rate = increase cache_max_bytes
Query latency (p50, p99)Query execution timeSpikes indicate performance issues
Bloom filter skip rateSegments skipped by bloom filtersHigher is better

Tiering Metrics (S3)

MetricDescriptionWhat to Watch
Segments in hot tierSegments on local SSDCapacity planning
Segments in warm tierSegments in S3Storage costs
Segment cache hit rateLocal cache for warm segmentsLow rate = increase cache
Upload/download bytesS3 transfer volumeCost monitoring

Cache Statistics

lynxdb cache stats

# Output:
# Query Cache
# Hits: 12,456
# Misses: 3,789
# Hit Rate: 76.7%
# Entries: 1,234
# Size: 456 MB / 1.0 GB
# Evictions: 89
# Machine-readable
lynxdb cache stats --format json

If the hit rate is consistently below 50%, consider increasing storage.cache_max_bytes or storage.cache_ttl.

Diagnostics

The lynxdb doctor command runs a comprehensive health check:

lynxdb doctor

# Output:
# ok Binary v0.1.0 (linux/amd64, go1.25.4)
# ok Config /home/user/.config/lynxdb/config.yaml (valid)
# ok Data dir /var/lib/lynxdb (42 GB free)
# ok Server localhost:3100 (healthy, uptime 2d 5h)
# ok Events 3.4M total
# ok Storage 1.2 GB
# ok Retention 7d
# ok Completion zsh detected
#
# All checks passed.
# Machine-readable
lynxdb doctor --format json

Query Profiling

Profile individual queries to identify performance bottlenecks:

# Basic profiling
lynxdb query 'level=error | stats count by source' --analyze

# Full profiling with per-operator timing
lynxdb query 'level=error | stats count by source' --analyze full

# Trace-level profiling
lynxdb query 'level=error | stats count by source' --analyze trace

The --analyze output shows:

  • Segments scanned vs skipped (with skip reasons: bloom, time, index, stats)
  • Rows read vs filtered
  • Per-operator execution time
  • Memory usage

Monitoring with External Tools

Prometheus Scraping

Poll the /api/v1/stats endpoint at regular intervals:

# prometheus.yml
scrape_configs:
- job_name: 'lynxdb'
scrape_interval: 30s
metrics_path: '/api/v1/stats'
static_configs:
- targets: ['lynxdb:3100']

Kubernetes ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: lynxdb
namespace: lynxdb
spec:
selector:
matchLabels:
app: lynxdb
endpoints:
- port: http
path: /api/v1/stats
interval: 30s

Alerting on Metrics

Use the lynxdb watch command for quick metric monitoring:

# Watch error rate every 30 seconds
lynxdb watch 'level=error | stats count' --interval 30s --diff

Set up server-side alerts for infrastructure monitoring:

# Alert when disk usage is high
curl -X POST localhost:3100/api/v1/alerts -d '{
"name": "High disk usage",
"q": "| from _internal | where metric=\"storage_bytes\" | where value > 100000000000",
"interval": "5m",
"channels": [
{"type": "slack", "config": {"webhook_url": "https://hooks.slack.com/..."}}
]
}'

For production deployments, monitor these at minimum:

  • /health endpoint -- uptime monitoring
  • Ingest rate -- detect pipeline failures
  • Disk usage and growth rate -- capacity planning
  • Query latency (p99) -- performance SLA
  • Active queries vs max_concurrent -- concurrency saturation
  • Cache hit rate -- query performance optimization
  • WAL size -- flush health
  • Compaction backlog -- storage health

Next Steps