Monitoring of TeskaLabs LogMan.io with Prometheus¶
This section provides recommendations for integrating Prometheus monitoring with TeskaLabs LogMan.io.
It covers infrastructure-level monitoring of cluster nodes and application-level monitoring of the TeskaLabs LogMan.io platform and its supporting services.
Scope and responsibilities
- Prometheus is not deployed as part of the TeskaLabs LogMan.io product.
- The customer is responsible for deploying and maintaining their own Prometheus infrastructure.
- TeskaLabs provides recommendations on which metrics to collect and monitor.
- TeskaLabs does not provide pre-built Prometheus alert rules.
- The customer or implementation partner is responsible for creating Prometheus scrape configuration, alerting rules, and dashboards.
Infrastructure Description¶
| Component | Details |
|---|---|
| Application | TeskaLabs LogMan.io |
| Deployment | Multi-node cluster |
| Hardware | Supermicro servers or Dell servers |
| Operating System | Ubuntu Server 22.04 LTS |
| Containerization | Docker with Docker Compose |
| Telemetry DB | InfluxDB-compatible telemetry exposed to Prometheus |
| Key services | Elasticsearch, Apache Kafka, ZooKeeper, MongoDB, NGINX, InfluxDB |
Prometheus Integration¶
Prometheus should scrape metrics from the LogMan.io telemetry layer or from a bridge/exporter exposing existing InfluxDB-style metrics.
Note
Metric names in this document intentionally use the existing LogMan.io / InfluxDB-style naming. Do not rename them to native Prometheus exporter metric names unless the telemetry pipeline is changed.
Recommended Metrics and Thresholds¶
System-Level Metrics¶
| Metric | Query | Threshold / Condition | Severity |
|---|---|---|---|
| System load 1m | system.load1 |
Warning if sustained above expected CPU core count | Medium |
| System load 5m | system.load5 |
Warning if sustained above expected CPU core count | Medium |
| System load 15m | system.load15 |
Warning if sustained above expected CPU core count | Medium |
| CPU usage | cpu.usage_user + cpu.usage_system |
Warning ≥ 85% sustained over 15 minutes | High |
| CPU I/O wait | cpu.usage_iowait |
Warning ≥ 30%, Critical ≥ 50% | High |
| Memory usage | mem.used_percent |
Warning ≥ 80%, Critical ≥ 90% | High |
| Swap usage | swap.used_percent |
Warning ≥ 50%, Critical ≥ 70% | Medium |
| SSD usage | disk.used_percent{path="/data/ssd"} |
Warning ≥ 65%, Critical ≥ 80% | High |
| HDD usage | disk.used_percent{path="/data/hdd"} |
Warning ≥ 65%, Critical ≥ 80% | High |
LogMan.io Application Metrics¶
| Metric | Query | Threshold / Condition | Severity |
|---|---|---|---|
| Log ingest EPS mean | mean(non_negative_difference(last(commlink.event.in)) / 60) |
Sudden drop or sustained unexpected decrease | High |
| Log ingest EPS max | max(non_negative_difference(last(commlink.event.in)) / 60) |
Used for capacity planning | Medium |
| Log ingest EPS 95th percentile | (lmio-charts) |
Used for capacity planning and anomaly detection | Medium |
Kafka Consumer Lag¶
| Metric | Query | Filter | Threshold / Condition | Severity |
|---|---|---|---|---|
| Parsec lag | kafka.consumer_group.lag |
group =~ /^lmio-parsec.*/ |
Lag must not grow continuously | Critical |
| Depositor lag | kafka.consumer_group.lag |
group = lmio_depositor |
Lag must not grow continuously | Critical |
| Correlator lag | kafka.consumer_group.lag |
group =~ /^lmio_correlator.*/ |
Lag must not grow continuously | Critical |
| Baseliner lag | kafka.consumer_group.lag |
group = lmio_baseliner |
Lag must not grow continuously | High |
| Watcher lag | kafka.consumer_group.lag |
group = lmio_watcher |
Lag must not grow continuously | High |
| ASAB IRIS lag | kafka.consumer_group.lag |
group = asab-iris |
Lag must not grow continuously | High |
| Alert Management lag | kafka.consumer_group.lag |
group = lmio-alerts |
Lag must not grow continuously | High |
Elasticsearch Monitoring¶
| Metric | Query | Threshold / Condition | Severity |
|---|---|---|---|
| Cluster health | elasticsearch_cluster_health.status |
Must be green | Critical |
| Relocating shards | elasticsearch_cluster_health.relocating_shards |
Should return to 0 after maintenance | Medium |
| Unassigned shards | elasticsearch_cluster_health.unassigned_shards |
Must be 0 | High |
| Active shards | elasticsearch_cluster_health.active_shards |
Monitor for unexpected changes | Medium |
| Active nodes | last(elasticsearch_clusterstats_nodes.count_total) |
Must match expected node count | Critical |
| Inactive nodes | max(elasticsearch_clusterstats_nodes.count_total) - last(elasticsearch_clusterstats_nodes.count_total) |
Must be 0 | Critical |
| Shards per index | _cat/shards or _cat/indices grouped by index |
Watch for excessive shard count | Medium |
| Index size | elasticsearch_indices_stats_total.store_size_in_bytes |
Investigate oversized indices | Medium |