Prophylactic Check¶
A prophylactic check in TeskaLabs LogMan.io is a structured, periodic review of the platform performed on a weekly or monthly basis. Its purpose is to proactively identify issues such as misconfigurations, degraded performance, or log source outages before they affect system stability or data integrity. The process includes reviewing the UI functionality, system metrics, Elasticsearch health, and verifying that all log sources are active and ingesting data correctly.
Introduction¶
Prophylactic checks MUST be performed periodically (weekly, monthly, etc.), ideally on the same day of the week and at a consistent time. It is crucial to consider variations in the quantity and frequency of incoming events, which can fluctuate based on the day of the week, working hours, and bank holidays.
The recommended periodicity of the prophylactic checks:
- 1x week in "hypercare" period; the hypercare period is a stabilization period after the initial installation of the product, lasting ~3 months.
- 1x month
The results of prophylactic checks should be documented in two separate reports:
- Partner Report – For the partner's internal tracking and escalation, and for communication with TeskaLabs if necessary.
- Customer Report – A simplified summary shared with the customer via their support channel (e.g., Slack, email).
Note
Be mindful to clearly explain and describe findings, ensuring severe issues are already addressed internally. If a resolution is underway, always communicate this in the report.
Support Re-distribution¶
A list of individuals involved in specific projects can be found in the internal documentation. If an issue arises, escalate it appropriately based on the client, partner, or customer involved.
Prophylactic Check Procedure¶
Prerequisities¶
- Access (HTTPS, SSH) to respective TeskaLabs LogMan.io installation.
- Ensure that all tenants available to you are reviewed during these checks.
TeskaLabs LogMan.io Functionalities¶
Navigation and Functionality Check¶
- Review each assigned tenant by checking all components in the sidebar:
- Discover: Verify log accessibility, search performance, and query execution. Ensure logs are correctly indexed and available.
- Dashboards: Check for correct data visualization, updated widgets, and proper filtering. Ensure graphs and charts reflect expected metrics.
- Reports: Validate the generation of reports, their formatting, and accuracy of data. Ensure scheduled reports are delivered correctly.
- Export: Test data extraction and ensure exported files are correctly formatted and contain expected content.
- Archive: Verify archived logs are retrievable and stored as expected.
- Logsources:
- Collectors: Ensure all collectors are active, receiving data, and properly configured.
- Event Lanes: Validate the correct classification of events and confirm no anomalies in processing.
- Baselines: Check baseline metrics to ensure they align with expected thresholds and review changes.
- Alerts: Confirm that alerting mechanisms are functional, test alert triggers, and review escalations.
- Tools: Ensure that built-in tools (e.g., Grafana, Kibana) are working as expected.
- Lookups: Validate lookup tables for accuracy and ensure they are correctly referenced in event processing.
- Library: Confirm availability and correct versioning of shared resources and content.
- Maintenance:
- Configuration: Ensure system settings and configurations are properly maintained and updated.
- Services: Verify that backend services are operational and running smoothly.
-
Auth & Roles:
- Credentials: Check the available credentials and keep them updated.
- Tenants: Confirm tenant access and ensure proper isolation of data.
- Sessions: Review session logs for anomalies or unauthorized access attempts.
- Roles: Ensure role-based access controls are properly enforced.
- Resources: Validate the availability of shared system resources.
- Clients: Ensure client configurations are correctly maintained and up to date.
-
Note: Some sections may be restricted based on user role. If you lack sufficient privileges, escalate access requirements within your organization or contact TeskaLabs.
ASAB-IRIS - Template Testing¶
- Test sending an email and a Slack message through the library.
- Any issues in this section should be reported internally.
Log Sources Monitoring¶
Log Time Zones¶
- Where to check: TeskaLabs LogMan.io Discover screen
- Check for logs with
@timestamp
in the future (now+2H or more) - Issue Reporting:
- Incorrect time zones should be reported internally.
- If the issue is due to incorrect logging device settings, report to the client’s support Slack channel.
- Analyze the source (
host.hostname
,lmio.source
, or IP address) and include this in the prophylaxis report.
Log Sources¶
- Where to check: Discover screen – Event Lanes, Event lane screen, and Baseliner
- Goal: Ensure every connected log source is currently active, investigate all outages and anomalies in logging.
- Issue Reporting:
- Communicate source-side issues with the customer; escalate LogMan.io-related issues to TeskaLabs.
Other Events¶
- Where to check: TeskaLabs LogMan.io -
lmio-others-events
index on the Discover screen - Common sources of error logs:
- Depositor logs
- Unstructured logs
- Multiline and fragmented logs
- Issue Reporting: Coordinate with TeskaLabs if parsing errors or unhandled formats are discovered.
System Logs¶
- Where to check: TeskaLabs LogMan.io - System tenant, index Events & Others
- Issue Reporting:
- Various log types may appear here; focus on error and warning logs.
- Discuss findings with TeskaLabs where necessary and inform the customer of any relevant impact.
Baseliner¶
- Where to check: TeskaLabs LogMan.io Baseliner screen
- Include checking redirection from the Discover screen to the Baseliner screen.
- Issue Reporting: If Baseliner is inactive, notify TeskaLabs.
Elasticsearch Monitoring¶
- Where to check: Grafana, dedicated dashboard – ElasticSearch and Kibana Stack Monitoring
- Sample data check for the last 24 hours
- Indicators to Monitor:
- Inactive Nodes → Should be zero
- System Health → Should be green; escalate immediately if yellow or red.
- Unassigned Shards → Should be zero and green; yellow or nonzero requires monitoring and reporting.
- JVM Heap → Monitor heap usage to ensure it remains below 75% to avoid excessive garbage collection, which can slow down query execution. If heap usage frequently exceeds 85%, consider increasing allocated memory or optimizing queries and indexing.
- Assigned ILM → Check that ILM policies are correctly applied to indices, ensuring data is moved according to defined retention and performance strategies. Misconfigured ILM can lead to increased storage costs and degraded search performance.
- Issue Reporting: Report issues to TeskaLabs if the problem is platform-level; otherwise, address configuration concerns within your team.
System Level Overview¶
- Where to check: Grafana, dedicated dashboard – System Level Overview
- Sample data check for the last 24 hours
- Metrics to Monitor:
- Disk usage: Must not exceed 80% (except for
/boot
, which should not exceed 95%). - Load: Should not exceed 40%; max load should equal the number of cores.
- CPU Should not exceed 85% utilization over an extended period.
- IOWait: Should be below 10%; values above 20% indicate significant disk read/write delays.
- RAM usage: Should not exceed 70%; continuous usage above 80% requires investigation.
- Swap Should be minimal; frequent or high swap usage indicates memory pressure and needs further analysis.
- Issue Reporting: Address local resource issues within your infrastructure team. For unclear system behavior, contact TeskaLabs.
Kafka Lag Overview¶
- Definition: Lag in this context refers to the delay between when a message is produced to a Kafka topic and when it is consumed by the respective consumer group. A high lag value indicates that consumers are not processing messages quickly enough, leading to potential data processing delays and system inefficiencies.
- Where to check: Grafana, dedicated dashboard – Kafka Lag Overview
- Groups to Monitor:
lmio parsec
lmio depositor
lmio baseliner
lmio correlator
- Key Metric: Lag value should not increase over time.
- Issue Reporting: Severe lag increase should be escalated immediately. Involve TeskaLabs if there is no clear resolution path.
Index Sizing & Lifecycle Monitoring¶
- Where to check: Kibana, Stack monitoring or Stack management
- Steps:
- Click on Indices.
- Sort the Data column from largest to smallest.
- Investigate indices larger than 200 GB.
- ILM Check:
- If an index is missing a numeric suffix, it is not connected to ILM.
- Check whether indices are correctly classified as hot/warm/cold.
- Sharding:
- Sharding should not exceed 500-600 shards per node to prevent excessive resource utilization.
- Verify shard allocation across cluster nodes to ensure balanced distribution.
- Issue Reporting: If excessive growth or ILM misconfigurations are found, coordinate remediation. Contact TeskaLabs if changes at the application level are required.
Counting EPS¶
- Definition EPS refers to the number of log events received per second. Monitoring EPS helps track the system's ingestion rate, detect anomalies, and ensure the system can handle peak loads efficiently.
- Where to check: LogMan.io UI for the last 7 days
- Metrics to Retrieve:
- MEAN value
- MAX value