Skip to content

Monitoring detection performance

Correlation performance is checked in two places, for two different questions:

Where What you check Question answered
Grafana Kafka consumer lag per Correlator instance (group_id) Which instance is falling behind?
LogMan.io UI → Detections (/detection) Time metrics per rule (timetotal, P/E/A/T, …) Which rules on that instance are expensive?

Do not use Grafana to tune individual rules by time. correlator.time is for operators and capacity planning. Analysts optimize rules using Total time on the Detections screen.

Time metrics in the UI come from LogMan.io Correlator, aggregated by LogMan.io TRex, refreshed about every 60 seconds.

See Correlator, Correlator metrics, and Prometheus monitoring.


Time metrics on Detections (which rules?)

Open Detections in the web app. The tree lists correlation rules from the Library (by folder and YAML file). Each rule row can show:

UI label Metric key Meaning
Total time (stopwatch) timetotal Cumulative CPU time spent in the rule (all phases)
P timepredicate Time in the predicate filter
E timeevaluate Time in evaluate (window rules)
A timeanalyze Time in analyze (window rules)
T timetrigger Time in trigger
Bell icon triggerin Trigger executions in the last hour
Arrow icon predicatehit Predicate hits in the last hour

The phase with the highest time is highlighted (warning color). Click a rule row with metrics to open Discover on complex events for that rule (last hour).

Folder rows can show aggregated metrics for all rules under that path.

Rule types

P / E / A / T breakdown is most meaningful for window correlation rules. Match, sigma, list, and other correlator types still report Total time and predicate/trigger counters, but may not use every phase.


How to read Total time

Total time is not latency of a single event. It is the sum of processing time the Correlator spent on that rule in the metrics window (reported in seconds, three decimal places).

Total time Guidance
< 1 s Usually fine
1–5 s Worth watching; check predicate and event volume
≥ 5 s High: treat as a performance problem for that rule

A high Total time with a high predicate hit count often means the rule runs on too many events. A high Total time with low hits points to an expensive predicate or analyze step on the events that do match.

Compare phases:

  • P dominates → simplify predicate, use structured fields, avoid scanning message.
  • E or A dominates → narrow the window (predicate, logsource), reduce cardinality in evaluate.dimension, simplify analyze.test.
  • T dominates → review trigger actions (templates, lookups, output volume).

Correlator samples time metrics on every Nth event (default N = 10) and scales the recorded duration so counters stay representative. Admins can set [correlator] time_metrics_sample_interval in the Correlator config.


Kafka consumer lag in Grafana (which instance?)

Each Correlator instance consumes Kafka with its own consumer group. Lag shows whether that instance processes events fast enough.

Typical group_id values:

  • Set in config: [pipeline:CorrelatorsPipeline:KafkaSource] group_id=... (see Correlator configuration)
  • Or auto-generated: lmio_correlator_{tenant}_{groups} (optional sharding suffix)

In Grafana (or Prometheus), watch:

  • Metric: kafka.consumer_group.lag
  • Filter: group =~ /^lmio_correlator.*/: pick the specific group that is growing

Lag must not grow continuously. A rising lag on one group_id means that Correlator instance is overloaded or its rules are too slow collectively.

Multiple Correlator instances

Large deployments split rule groups across several Correlator instances (different [declarations] groups and often different group_id). Optimize rules on the instance whose lag is high in Grafana, not on an instance that is healthy.

Map instance → rules:

  • [declarations] groups / model correlator.groups lists Library folders that instance loads (for example /Correlations/Firewall/).
  • The same paths appear on the Detections screen tree.

Per-rule metrics on Detections do not include Kafka lag.


Typical causes of slow rules

1. Substring search on message instead of structured fields

Avoid filtering on the raw log line when parsed fields exist:

# Slow: scans the full message text on every candidate event
predicate:
  - !IN
    what: "failed password"
    where: !ITEM EVENT message

Prefer equality on normalized ECS / schema fields populated by Parsec:

# Fast: index-friendly, matches the parser output
predicate:
  - !EQ
    - !ITEM EVENT event.action
    - "login-failed"
  - !EQ
    - !ITEM EVENT event.outcome
    - "failure"

If the data source does not expose a suitable field yet, extend the parser mapping first, then write the detection against the new field. See Predicates and Parsing rules.

2. Predicate too broad

  • Missing event.dataset, event.category, observer.type, or logsource filters.
  • !OR with many branches instead of a single !EQ on event.code or rule.id.
  • !IN with where: !EVENT only to test field presence; combine with narrow value checks.

Every event that enters the predicate costs CPU even on miss. Use the bell / arrow counts: very high predicate hit with few triggers may be normal; very high hits with high P time means the filter is too loose.

3. Heavy window evaluate / analyze

  • Short span with high event rate → large in-memory windows.
  • High-cardinality evaluate.dimension (e.g. free-text fields).
  • Expensive analyze.test (complex SPLang, many lookups).

4. Lookups in predicate

Each lookup call adds latency. Move lookups to trigger when possible, or pre-enrich in Parsec.


Practical workflow

  1. Grafana: Find a Correlator consumer group with growing lag (kafka.consumer_group.lag, group =~ /^lmio_correlator.*/). Note the group_id and which instance / declaration groups it belongs to.
  2. LogMan.io → Detections: Open the rule tree for that tenant. Focus on rules under the same Library folders the lagging instance loads.
  3. Find rules with Total time ≥ 5 s (and check which phase P/E/A/T is highlighted).
  4. Edit those rules in the Library: tighten predicate, avoid message substring search, fix window/analyze cost (see below).
  5. Check predicate hit vs trigger in on Detections to confirm the filter is not too broad.
  6. After deploy, wait a few minutes: re-check Total time in Detections and lag in Grafana for that group_id.

If lag is healthy on all instances but some rules still show high Total time, optimization is still worthwhile (wasted CPU), but it is not urgent for backlog.