Monitoring of TeskaLabs LogMan.io with Prometheus¶

This section provides recommendations for integrating Prometheus monitoring with TeskaLabs LogMan.io.

It covers infrastructure-level monitoring of cluster nodes and application-level monitoring of the TeskaLabs LogMan.io platform and its supporting services.

Scope and responsibilities

Prometheus is not deployed as part of the TeskaLabs LogMan.io product.
The customer is responsible for deploying and maintaining their own Prometheus infrastructure.
TeskaLabs provides recommendations on which metrics to collect and monitor.
TeskaLabs does not provide pre-built Prometheus alert rules.
The customer or implementation partner is responsible for creating Prometheus scrape configuration, alerting rules, and dashboards.

Infrastructure Description¶

Component	Details
Application	TeskaLabs LogMan.io
Deployment	Multi-node cluster
Hardware	Supermicro servers or Dell servers
Operating System	Ubuntu Server 22.04 LTS
Containerization	Docker with Docker Compose
Telemetry DB	InfluxDB-compatible telemetry exposed to Prometheus
Key services	Elasticsearch, Apache Kafka, ZooKeeper, MongoDB, NGINX, InfluxDB

Prometheus Integration¶

Prometheus should scrape metrics from the LogMan.io telemetry layer or from a bridge/exporter exposing existing InfluxDB-style metrics.

Note

Metric names in this document intentionally use the existing LogMan.io / InfluxDB-style naming. Do not rename them to native Prometheus exporter metric names unless the telemetry pipeline is changed.

Recommended Metrics and Thresholds¶

System-Level Metrics¶

Metric	Query	Threshold / Condition	Severity
System load 1m	`system.load1`	Warning if sustained above expected CPU core count	Medium
System load 5m	`system.load5`	Warning if sustained above expected CPU core count	Medium
System load 15m	`system.load15`	Warning if sustained above expected CPU core count	Medium
CPU usage	`cpu.usage_user + cpu.usage_system`	Warning ≥ 85% sustained over 15 minutes	High
CPU I/O wait	`cpu.usage_iowait`	Warning ≥ 30%, Critical ≥ 50%	High
Memory usage	`mem.used_percent`	Warning ≥ 80%, Critical ≥ 90%	High
Swap usage	`swap.used_percent`	Warning ≥ 50%, Critical ≥ 70%	Medium
SSD usage	`disk.used_percent{path="/data/ssd"}`	Warning ≥ 65%, Critical ≥ 80%	High
HDD usage	`disk.used_percent{path="/data/hdd"}`	Warning ≥ 65%, Critical ≥ 80%	High

LogMan.io Application Metrics¶

Metric	Query	Threshold / Condition	Severity
Log ingest EPS mean	`mean(non_negative_difference(last(commlink.event.in)) / 60)`	Sudden drop or sustained unexpected decrease	High
Log ingest EPS max	`max(non_negative_difference(last(commlink.event.in)) / 60)`	Used for capacity planning	Medium
Log ingest EPS 95th percentile	`(lmio-charts)`	Used for capacity planning and anomaly detection	Medium

Kafka Consumer Lag¶

Metric	Query	Filter	Threshold / Condition	Severity
Parsec lag	`kafka.consumer_group.lag`	`group =~ /^lmio-parsec.*/`	Lag must not grow continuously	Critical
Depositor lag	`kafka.consumer_group.lag`	`group = lmio_depositor`	Lag must not grow continuously	Critical
Correlator lag	`kafka.consumer_group.lag`	`group =~ /^lmio_correlator.*/`	Lag must not grow continuously	Critical
Baseliner lag	`kafka.consumer_group.lag`	`group = lmio_baseliner`	Lag must not grow continuously	High
Watcher lag	`kafka.consumer_group.lag`	`group = lmio_watcher`	Lag must not grow continuously	High
ASAB IRIS lag	`kafka.consumer_group.lag`	`group = asab-iris`	Lag must not grow continuously	High
Alert Management lag	`kafka.consumer_group.lag`	`group = lmio-alerts`	Lag must not grow continuously	High

Elasticsearch Monitoring¶

Metric	Query	Threshold / Condition	Severity
Cluster health	`elasticsearch_cluster_health.status`	Must be green	Critical
Relocating shards	`elasticsearch_cluster_health.relocating_shards`	Should return to 0 after maintenance	Medium
Unassigned shards	`elasticsearch_cluster_health.unassigned_shards`	Must be 0	High
Active shards	`elasticsearch_cluster_health.active_shards`	Monitor for unexpected changes	Medium
Active nodes	`last(elasticsearch_clusterstats_nodes.count_total)`	Must match expected node count	Critical
Inactive nodes	`max(elasticsearch_clusterstats_nodes.count_total) - last(elasticsearch_clusterstats_nodes.count_total)`	Must be 0	Critical
Shards per index	`_cat/shards` or `_cat/indices` grouped by index	Watch for excessive shard count	Medium
Index size	`elasticsearch_indices_stats_total.store_size_in_bytes`	Investigate oversized indices	Medium