Skip to content

Monitoring of TeskaLabs LogMan.io with Prometheus

This section provides recommendations for integrating Prometheus monitoring with TeskaLabs LogMan.io.

It covers infrastructure-level monitoring of cluster nodes and application-level monitoring of the TeskaLabs LogMan.io platform and its supporting services.

Scope and responsibilities

  • Prometheus is not deployed as part of the TeskaLabs LogMan.io product.
  • The customer is responsible for deploying and maintaining their own Prometheus infrastructure.
  • TeskaLabs provides recommendations on which metrics to collect and monitor.
  • TeskaLabs does not provide pre-built Prometheus alert rules.
  • The customer or implementation partner is responsible for creating Prometheus scrape configuration, alerting rules, and dashboards.

Infrastructure Description

Component Details
Application TeskaLabs LogMan.io
Deployment Multi-node cluster
Hardware Supermicro servers or Dell servers
Operating System Ubuntu Server 22.04 LTS
Containerization Docker with Docker Compose
Telemetry DB InfluxDB-compatible telemetry exposed to Prometheus
Key services Elasticsearch, Apache Kafka, ZooKeeper, MongoDB, NGINX, InfluxDB

Prometheus Integration

Prometheus should scrape metrics from the LogMan.io telemetry layer or from a bridge/exporter exposing existing InfluxDB-style metrics.

Note

Metric names in this document intentionally use the existing LogMan.io / InfluxDB-style naming. Do not rename them to native Prometheus exporter metric names unless the telemetry pipeline is changed.

System-Level Metrics

Metric Query Threshold / Condition Severity
System load 1m system.load1 Warning if sustained above expected CPU core count Medium
System load 5m system.load5 Warning if sustained above expected CPU core count Medium
System load 15m system.load15 Warning if sustained above expected CPU core count Medium
CPU usage cpu.usage_user + cpu.usage_system Warning ≥ 85% sustained over 15 minutes High
CPU I/O wait cpu.usage_iowait Warning ≥ 30%, Critical ≥ 50% High
Memory usage mem.used_percent Warning ≥ 80%, Critical ≥ 90% High
Swap usage swap.used_percent Warning ≥ 50%, Critical ≥ 70% Medium
SSD usage disk.used_percent{path="/data/ssd"} Warning ≥ 65%, Critical ≥ 80% High
HDD usage disk.used_percent{path="/data/hdd"} Warning ≥ 65%, Critical ≥ 80% High

LogMan.io Application Metrics

Metric Query Threshold / Condition Severity
Log ingest EPS mean mean(non_negative_difference(last(commlink.event.in)) / 60) Sudden drop or sustained unexpected decrease High
Log ingest EPS max max(non_negative_difference(last(commlink.event.in)) / 60) Used for capacity planning Medium
Log ingest EPS 95th percentile (lmio-charts) Used for capacity planning and anomaly detection Medium

Kafka Consumer Lag

Metric Query Filter Threshold / Condition Severity
Parsec lag kafka.consumer_group.lag group =~ /^lmio-parsec.*/ Lag must not grow continuously Critical
Depositor lag kafka.consumer_group.lag group = lmio_depositor Lag must not grow continuously Critical
Correlator lag kafka.consumer_group.lag group =~ /^lmio_correlator.*/ Lag must not grow continuously Critical
Baseliner lag kafka.consumer_group.lag group = lmio_baseliner Lag must not grow continuously High
Watcher lag kafka.consumer_group.lag group = lmio_watcher Lag must not grow continuously High
ASAB IRIS lag kafka.consumer_group.lag group = asab-iris Lag must not grow continuously High
Alert Management lag kafka.consumer_group.lag group = lmio-alerts Lag must not grow continuously High

Elasticsearch Monitoring

Metric Query Threshold / Condition Severity
Cluster health elasticsearch_cluster_health.status Must be green Critical
Relocating shards elasticsearch_cluster_health.relocating_shards Should return to 0 after maintenance Medium
Unassigned shards elasticsearch_cluster_health.unassigned_shards Must be 0 High
Active shards elasticsearch_cluster_health.active_shards Monitor for unexpected changes Medium
Active nodes last(elasticsearch_clusterstats_nodes.count_total) Must match expected node count Critical
Inactive nodes max(elasticsearch_clusterstats_nodes.count_total) - last(elasticsearch_clusterstats_nodes.count_total) Must be 0 Critical
Shards per index _cat/shards or _cat/indices grouped by index Watch for excessive shard count Medium
Index size elasticsearch_indices_stats_total.store_size_in_bytes Investigate oversized indices Medium