Monitoring detection performance¶

Correlation performance is checked in two places, for two different questions:

Where	What you check	Question answered
Grafana	Kafka consumer lag per Correlator instance (`group_id`)	Which instance is falling behind?
LogMan.io UI → Detections (`/detection`)	Time metrics per rule (`timetotal`, P/E/A/T, …)	Which rules on that instance are expensive?

Do not use Grafana to tune individual rules by time. correlator.time is for operators and capacity planning. Analysts optimize rules using Total time on the Detections screen.

Time metrics in the UI come from LogMan.io Correlator, aggregated by LogMan.io TRex, refreshed about every 60 seconds.

See Correlator, Correlator metrics, and Prometheus monitoring.

Time metrics on Detections (which rules?)¶

Open Detections in the web app. The tree lists correlation rules from the Library (by folder and YAML file). Each rule row can show:

UI label	Metric key	Meaning
Total time (stopwatch)	`timetotal`	Cumulative CPU time spent in the rule (all phases)
P	`timepredicate`	Time in the `predicate` filter
E	`timeevaluate`	Time in `evaluate` (window rules)
A	`timeanalyze`	Time in `analyze` (window rules)
T	`timetrigger`	Time in `trigger`
Bell icon	`triggerin`	Trigger executions in the last hour
Arrow icon	`predicatehit`	Predicate hits in the last hour

The phase with the highest time is highlighted (warning color). Click a rule row with metrics to open Discover on complex events for that rule (last hour).

Folder rows can show aggregated metrics for all rules under that path.

Rule types

P / E / A / T breakdown is most meaningful for window correlation rules. Match, sigma, list, and other correlator types still report Total time and predicate/trigger counters, but may not use every phase.

How to read Total time¶

Total time is not latency of a single event. It is the sum of processing time the Correlator spent on that rule in the metrics window (reported in seconds, three decimal places).

Total time	Guidance
< 1 s	Usually fine
1–5 s	Worth watching; check predicate and event volume
≥ 5 s	High: treat as a performance problem for that rule

A high Total time with a high predicate hit count often means the rule runs on too many events. A high Total time with low hits points to an expensive predicate or analyze step on the events that do match.

Compare phases:

P dominates → simplify predicate, use structured fields, avoid scanning message.
E or A dominates → narrow the window (predicate, logsource), reduce cardinality in evaluate.dimension, simplify analyze.test.
T dominates → review trigger actions (templates, lookups, output volume).

Correlator samples time metrics on every Nth event (default N = 10) and scales the recorded duration so counters stay representative. Admins can set [correlator] time_metrics_sample_interval in the Correlator config.

Kafka consumer lag in Grafana (which instance?)¶

Each Correlator instance consumes Kafka with its own consumer group. Lag shows whether that instance processes events fast enough.

Typical group_id values:

Set in config: [pipeline:CorrelatorsPipeline:KafkaSource] group_id=... (see Correlator configuration)
Or auto-generated: lmio_correlator_{tenant}_{groups} (optional sharding suffix)

In Grafana (or Prometheus), watch:

Metric: kafka.consumer_group.lag
Filter: group =~ /^lmio_correlator.*/: pick the specific group that is growing

Lag must not grow continuously. A rising lag on one group_id means that Correlator instance is overloaded or its rules are too slow collectively.

Multiple Correlator instances

Large deployments split rule groups across several Correlator instances (different [declarations] groups and often different group_id). Optimize rules on the instance whose lag is high in Grafana, not on an instance that is healthy.

Map instance → rules:

[declarations] groups / model correlator.groups lists Library folders that instance loads (for example /Correlations/Firewall/).
The same paths appear on the Detections screen tree.

Per-rule metrics on Detections do not include Kafka lag.

Typical causes of slow rules¶

1. Substring search on `message` instead of structured fields¶

Avoid filtering on the raw log line when parsed fields exist:

# Slow: scans the full message text on every candidate event
predicate:
  - !IN
    what: "failed password"
    where: !ITEM EVENT message

Prefer equality on normalized ECS / schema fields populated by Parsec:

# Fast: index-friendly, matches the parser output
predicate:
  - !EQ
    - !ITEM EVENT event.action
    - "login-failed"
  - !EQ
    - !ITEM EVENT event.outcome
    - "failure"

If the data source does not expose a suitable field yet, extend the parser mapping first, then write the detection against the new field. See Predicates and Parsing rules.

2. Predicate too broad¶

Missing event.dataset, event.category, observer.type, or logsource filters.
!OR with many branches instead of a single !EQ on event.code or rule.id.
!IN with where: !EVENT only to test field presence; combine with narrow value checks.

Every event that enters the predicate costs CPU even on miss. Use the bell / arrow counts: very high predicate hit with few triggers may be normal; very high hits with high P time means the filter is too loose.

3. Heavy window evaluate / analyze¶

Short span with high event rate → large in-memory windows.
High-cardinality evaluate.dimension (e.g. free-text fields).
Expensive analyze.test (complex SPLang, many lookups).

4. Lookups in predicate¶

Each lookup call adds latency. Move lookups to trigger when possible, or pre-enrich in Parsec.

Practical workflow¶

Grafana: Find a Correlator consumer group with growing lag (kafka.consumer_group.lag, group =~ /^lmio_correlator.*/). Note the group_id and which instance / declaration groups it belongs to.
LogMan.io → Detections: Open the rule tree for that tenant. Focus on rules under the same Library folders the lagging instance loads.
Find rules with Total time ≥ 5 s (and check which phase P/E/A/T is highlighted).
Edit those rules in the Library: tighten predicate, avoid message substring search, fix window/analyze cost (see below).
Check predicate hit vs trigger in on Detections to confirm the filter is not too broad.
After deploy, wait a few minutes: re-check Total time in Detections and lag in Grafana for that group_id.

If lag is healthy on all instances but some rules still show high Total time, optimization is still worthwhile (wasted CPU), but it is not urgent for backlog.

What is a detection rule?
How to write a window correlation rule
Predicates
Correlator metrics
Pipeline metrics (throughput, duty cycle)
Correlator configuration (Kafka group_id, [correlator] options)