Prophylactic Check¶
Introduction¶
Prophylactic checks MUST be performed periodically (weekly, monthly, etc.), ideally on the same day of the week and at a consistent time. It is crucial to consider variations in the quantity and frequency of incoming events, which can fluctuate based on the day of the week, working hours, and bank holidays.
The recommended periodicity of the prophylactic checks:
- 1x week in "hypercare" period; the hypercare period is a stabilization period after the initial installation of the product, lasting ~3 months.
- 1x month
The results of prophylactic checks should be documented in two separate reports:
- Vendor Report – Maintained for internal tracking in the Vendor (TeskaLabs) system.
- Client Report – Shared with end user via their support Slack channel or other support channel.
Note
Be mindful to clearly explain and describe findings, ensuring severe issues are already addressed internally. If a resolution is underway, always communicate this in the report.
Support Re-distribution¶
A list of individuals involved in specific projects can be found in the internal documentation. If an issue arises, escalate it appropriately based on the client, partner, or customer involved.
Prophylactic Check Procedure¶
Prerequisities¶
- Access (HTTPS, SSH) to respective TeskaLabs LogMan.io installation.
- Ensure that all tenants available to you are reviewed during these checks.
TeskaLabs LogMan.io Functionalities¶
Navigation and Functionality Check¶
- Review each assigned tenant by checking all components in the sidebar:
- Discover: Verify log accessibility, search performance, and query execution. Ensure logs are correctly indexed and available.
- Dashboards: Check for correct data visualization, updated widgets, and proper filtering. Ensure graphs and charts reflect expected metrics.
- Reports: Validate the generation of reports, their formatting, and accuracy of data. Ensure scheduled reports are delivered correctly.
- Export: Test data extraction and ensure exported files are correctly formatted and contain expected content.
- Archive: Verify archived logs are retrievable and stored as expected.
- Logsources:
- Collectors: Ensure all collectors are active, receiving data, and properly configured.
- Event Lanes: Validate the correct classification of events and confirm no anomalies in processing.
- Baselines: Check baseline metrics to ensure they align with expected thresholds and review changes.
- Alerts: Confirm that alerting mechanisms are functional, test alert triggers, and review escalations.
- Tools: Ensure that built-in tools (e.g., Grafana, Kibana) are working as expected.
- Lookups: Validate lookup tables for accuracy and ensure they are correctly referenced in event processing.
- Library: Confirm availability and correct versioning of shared resources and content.
- Maintenance:
- Configuration: Ensure system settings and configurations are properly maintained and updated.
- Services: Verify that backend services are operational and running smoothly.
-
Auth & Roles:
- Credentials: Check the available credentials and keep them updated.
- Tenants: Confirm tenant access and ensure proper isolation of data.
- Sessions: Review session logs for anomalies or unauthorized access attempts.
- Roles: Ensure role-based access controls are properly enforced.
- Resources: Validate the availability of shared system resources.
- Clients: Ensure client configurations are correctly maintained and up to date.
-
Note: If you are not a superuser, you may not see every section mentioned above.
ASAB-IRIS - Template Testing¶
- Test sending an email and a Slack message through the library.
- Any issues in this section should be reported internally.
Log Sources Monitoring¶
Log Time Zones¶
- Where to check: TeskaLabs LogMan.io Discover screen
- Check for logs with
@timestamp
in the future (now+2H or more) - Issue Reporting:
- Incorrect time zones should be reported internally.
- If the issue is due to incorrect logging device settings, report to the client’s support Slack channel.
- Analyze the source (
host.hostname
,lmio.source
, or IP address) and include this in the prophylaxis report.
Log Sources¶
- Where to check: Discover screen – Event Lanes, Event lane screen, and Baseliner
- Goal: Ensure every connected log source is currently active, investigate all outages and anomalies in logging.
- Issue Reporting:
- Internally and to the client/partner support Slack channel.
Other Events¶
- Where to check: TeskaLabs LogMan.io -
lmio-others-events
index on the Discover screen - Common sources of error logs:
- Depositor logs
- Unstructured logs
- Multiline and fragmented logs
- Issue Reporting: Internally to the parsec team.
System Logs¶
- Where to check: TeskaLabs LogMan.io - System tenant, index Events & Others
- Issue Reporting:
- Various log types may appear here; focus on error and warning logs.
- Report findings internally or to the client support Slack channel if necessary.
Baseliner¶
- Where to check: TeskaLabs LogMan.io Baseliner screen
- Include checking redirection from the Discover screen to the Baseliner screen.
- Issue Reporting: If Baseliner is inactive, report internally.
Elasticsearch Monitoring¶
- Where to check: Grafana, dedicated dashboard – ElasticSearch and Kibana Stack Monitoring
- Sample data check for the last 24 hours
- Indicators to Monitor:
- Inactive Nodes → Should be zero
- System Health → Should be green; escalate immediately if yellow or red.
- Unassigned Shards → Should be zero and green; yellow or nonzero requires monitoring and reporting.
- JVM Heap → Monitor heap usage to ensure it remains below 75% to avoid excessive garbage collection, which can slow down query execution. If heap usage frequently exceeds 85%, consider increasing allocated memory or optimizing queries and indexing.
- Assigned ILM → Check that ILM policies are correctly applied to indices, ensuring data is moved according to defined retention and performance strategies. Misconfigured ILM can lead to increased storage costs and degraded search performance.
- Issue Reporting: Escalate severe issues internally.
System Level Overview¶
- Where to check: Grafana, dedicated dashboard – System Level Overview
- Sample data check for the last 24 hours
- Metrics to Monitor:
- Disk usage: Must not exceed 80% (except for
/boot
, which should not exceed 95%). - Load: Should not exceed 40%; max load should equal the number of cores.
- CPU Should not exceed 85% utilization over an extended period.
- IOWait: Should be below 10%; values above 20% indicate significant disk read/write delays.
- RAM usage: Should not exceed 70%; continuous usage above 80% requires investigation.
- Swap Should be minimal; frequent or high swap usage indicates memory pressure and needs further analysis.
- Issue Reporting: Report internally.
Kafka Lag Overview¶
- Definition: Lag in this context refers to the delay between when a message is produced to a Kafka topic and when it is consumed by the respective consumer group. A high lag value indicates that consumers are not processing messages quickly enough, leading to potential data processing delays and system inefficiencies.
- Where to check: Grafana, dedicated dashboard – Kafka Lag Overview
- Groups to Monitor:
lmio parsec
lmio depositor
lmio baseliner
lmio correlator
- Key Metric: Lag value should not increase over time.
- Issue Reporting:
- If lag increases compared to the previous week, report in the internal Slack channel.
- Severe lag increase should be escalated immediately.
Index Sizing & Lifecycle Monitoring¶
- Where to check: Kibana, Stack monitoring or Stack management
- Steps:
- Click on Indices.
- Sort the Data column from largest to smallest.
- Investigate indices larger than 200 GB.
- ILM Check:
- If an index is missing a numeric suffix, it is not connected to ILM.
- Check whether indices are correctly classified as hot/warm/cold.
- Sharding:
- Sharding should not exceed 500-600 shards per node to prevent excessive resource utilization.
- Verify shard allocation across cluster nodes to ensure balanced distribution.
- Issue Reporting: Report internally and escalate immediately.
Counting EPS¶
- Definition EPS refers to the number of log events received per second. Monitoring EPS helps track the system's ingestion rate, detect anomalies, and ensure the system can handle peak loads efficiently.
- Where to check: LogMan.io UI for the last 7 days
- Metrics to Retrieve:
- MEAN value
- MAX value