Monitoring detection performance¶
Correlation performance is checked in two places, for two different questions:
| Where | What you check | Question answered |
|---|---|---|
| Grafana | Kafka consumer lag per Correlator instance (group_id) |
Which instance is falling behind? |
LogMan.io UI → Detections (/detection) |
Time metrics per rule (timetotal, P/E/A/T, …) |
Which rules on that instance are expensive? |
Do not use Grafana to tune individual rules by time. correlator.time is for operators and capacity planning. Analysts optimize rules using Total time on the Detections screen.
Time metrics in the UI come from LogMan.io Correlator, aggregated by LogMan.io TRex, refreshed about every 60 seconds.
See Correlator, Correlator metrics, and Prometheus monitoring.
Time metrics on Detections (which rules?)¶
Open Detections in the web app. The tree lists correlation rules from the Library (by folder and YAML file). Each rule row can show:
| UI label | Metric key | Meaning |
|---|---|---|
| Total time (stopwatch) | timetotal |
Cumulative CPU time spent in the rule (all phases) |
| P | timepredicate |
Time in the predicate filter |
| E | timeevaluate |
Time in evaluate (window rules) |
| A | timeanalyze |
Time in analyze (window rules) |
| T | timetrigger |
Time in trigger |
| Bell icon | triggerin |
Trigger executions in the last hour |
| Arrow icon | predicatehit |
Predicate hits in the last hour |
The phase with the highest time is highlighted (warning color). Click a rule row with metrics to open Discover on complex events for that rule (last hour).
Folder rows can show aggregated metrics for all rules under that path.
Rule types
P / E / A / T breakdown is most meaningful for window correlation rules. Match, sigma, list, and other correlator types still report Total time and predicate/trigger counters, but may not use every phase.
How to read Total time¶
Total time is not latency of a single event. It is the sum of processing time the Correlator spent on that rule in the metrics window (reported in seconds, three decimal places).
| Total time | Guidance |
|---|---|
| < 1 s | Usually fine |
| 1–5 s | Worth watching; check predicate and event volume |
| ≥ 5 s | High: treat as a performance problem for that rule |
A high Total time with a high predicate hit count often means the rule runs on too many events. A high Total time with low hits points to an expensive predicate or analyze step on the events that do match.
Compare phases:
- P dominates → simplify
predicate, use structured fields, avoid scanningmessage. - E or A dominates → narrow the window (
predicate,logsource), reduce cardinality inevaluate.dimension, simplifyanalyze.test. - T dominates → review trigger actions (templates, lookups, output volume).
Correlator samples time metrics on every Nth event (default N = 10) and scales the recorded duration so counters stay representative. Admins can set [correlator] time_metrics_sample_interval in the Correlator config.
Kafka consumer lag in Grafana (which instance?)¶
Each Correlator instance consumes Kafka with its own consumer group. Lag shows whether that instance processes events fast enough.
Typical group_id values:
- Set in config:
[pipeline:CorrelatorsPipeline:KafkaSource] group_id=...(see Correlator configuration) - Or auto-generated:
lmio_correlator_{tenant}_{groups}(optional sharding suffix)
In Grafana (or Prometheus), watch:
- Metric:
kafka.consumer_group.lag - Filter:
group =~ /^lmio_correlator.*/: pick the specific group that is growing
Lag must not grow continuously. A rising lag on one group_id means that Correlator instance is overloaded or its rules are too slow collectively.
Multiple Correlator instances
Large deployments split rule groups across several Correlator instances (different [declarations] groups and often different group_id). Optimize rules on the instance whose lag is high in Grafana, not on an instance that is healthy.
Map instance → rules:
[declarations] groups/ modelcorrelator.groupslists Library folders that instance loads (for example/Correlations/Firewall/).- The same paths appear on the Detections screen tree.
Per-rule metrics on Detections do not include Kafka lag.
Typical causes of slow rules¶
1. Substring search on message instead of structured fields¶
Avoid filtering on the raw log line when parsed fields exist:
# Slow: scans the full message text on every candidate event
predicate:
- !IN
what: "failed password"
where: !ITEM EVENT message
Prefer equality on normalized ECS / schema fields populated by Parsec:
# Fast: index-friendly, matches the parser output
predicate:
- !EQ
- !ITEM EVENT event.action
- "login-failed"
- !EQ
- !ITEM EVENT event.outcome
- "failure"
If the data source does not expose a suitable field yet, extend the parser mapping first, then write the detection against the new field. See Predicates and Parsing rules.
2. Predicate too broad¶
- Missing
event.dataset,event.category,observer.type, orlogsourcefilters. !ORwith many branches instead of a single!EQonevent.codeorrule.id.!INwithwhere: !EVENTonly to test field presence; combine with narrow value checks.
Every event that enters the predicate costs CPU even on miss. Use the bell / arrow counts: very high predicate hit with few triggers may be normal; very high hits with high P time means the filter is too loose.
3. Heavy window evaluate / analyze¶
- Short
spanwith high event rate → large in-memory windows. - High-cardinality
evaluate.dimension(e.g. free-text fields). - Expensive
analyze.test(complex SPLang, many lookups).
4. Lookups in predicate¶
Each lookup call adds latency. Move lookups to trigger when possible, or pre-enrich in Parsec.
Practical workflow¶
- Grafana: Find a Correlator consumer group with growing lag (
kafka.consumer_group.lag,group =~ /^lmio_correlator.*/). Note thegroup_idand which instance / declaration groups it belongs to. - LogMan.io → Detections: Open the rule tree for that tenant. Focus on rules under the same Library folders the lagging instance loads.
- Find rules with Total time ≥ 5 s (and check which phase P/E/A/T is highlighted).
- Edit those rules in the Library: tighten
predicate, avoidmessagesubstring search, fix window/analyze cost (see below). - Check predicate hit vs trigger in on Detections to confirm the filter is not too broad.
- After deploy, wait a few minutes: re-check Total time in Detections and lag in Grafana for that
group_id.
If lag is healthy on all instances but some rules still show high Total time, optimization is still worthwhile (wasted CPU), but it is not urgent for backlog.
Related documentation¶
- What is a detection rule?
- How to write a window correlation rule
- Predicates
- Correlator metrics
- Pipeline metrics (throughput, duty cycle)
- Correlator configuration (Kafka
group_id,[correlator]options)