Continuity Plan¶
Risk matrix¶
The risk matrix defines the level of risk by considering the category of "Likelihood" of an incident occurring against the category of "Impact". Both categories are given a score between 1 and 5. By multiplying the scores for "Likelihood" and "Impact" together, a total risk score is be produced.
Likelihood¶
Likelihood | Score |
---|---|
Rare | 1 |
Unlikely | 2 |
Possible | 3 |
Likely | 4 |
Almost certain | 5 |
Impact¶
Impact | Score | Description |
---|---|---|
Insignificant | 1 | The functionality is not impacted, performance is not reduced, downtime is not needed. |
Minor | 2 | The functionality is not impacted, the performance is not reduced, downtime of the impacted cluster node is needed. |
Moderate | 3 | The functionality is not impacted, the performance is reduced, downtime of the impacted cluster node is needed. |
Major | 4 | The functionality is impacted, the performance is significantly reduced, downtime of the cluster is needed. |
Catastrophic | 5 | Total loss of functionality. |
Incident scenarios¶
Complete system failure¶
Impact: Catastrophic (5)
Likelihood: Rare (1)
Risk level: medium-high
Risk mitigation:
- Geographically distributed cluster
- Active use of monitoring and alerting
- Prophylactic maintenance
- Strong cyber-security posture
Recovery:
- Contact the support and/or vendor and consult the strategy.
- Restore the hardware functionality.
- Restore the system from the backup of the site configuration.
- Restore the data from the offline backup (start with the most fresh data and continue to the history).
Loss of the node in the cluster¶
Impact: Moderate (4)
Likelihood: Unlikely (2)
Risk level: medium-low
Risk mitigation:
- Geographically distributed cluster
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Contact the support and/or vendor and consult the strategy.
- Restore the hardware functionality.
- Restore the system from the backup of the site configuration.
- Restore the data from the offline backup (start with the most fresh data and continue to the history).
Loss of the fast storage drive in one node of the cluster¶
Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low
Fast drives are in RAID 1 array so the loss of one drive is non-critical. Ensure quick replacement of the failed drive to prevent a second fast drive failure. A second fast drive failure will escalate to a "Loss of the node in the cluster".
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely replacement of the failed drive
Recovery:
- Turn off the impacted cluster node
- Replace failed fast storage drive ASAP
- Turn on the impacted cluster node
- Verify correct RAID1 array reconstruction
Note
Hot swap of the fast storage drive is supported on a specific customer request.
Fast storage space shortage¶
Impact: Moderate (3)
Likelihood: Possible (3)
Risk level: medium-high
This situation is problematic if it happens on multiple nodes of the cluster simultaneously. Use monitoring tools to identify this situation ahead of escalation.
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Remove unnecessary data from the fast storage space.
- Adjust the life cycle configuration so that the data are moved to slow storage space sooner.
Loss of the slow storage drive in one node of the cluster¶
Impact: Insignificant (1)
Likelihood: Likely (4)
Risk level: medium-low
Slow drives are in RAID 5 or RAID 6 array so the loss of one drive is non-critical. Ensure quick replacement of the failed drive to prevent another drive failure. A second drive failure in RAID 5 or third drive failure in RAID 6 will escalate to a "Loss of the node in the cluster".
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely replacement of the failed drive
Recovery:
- Replace failed slow storage drive ASAP (hot swap)
- Verify a correct slow storage RAID reconstruction
Slow storage space shortage¶
Impact: Moderate (3)
Likelihood: Likely (4)
Risk level: medium-high
This situation is problematic if it happens on multiple nodes of the cluster simultaneously. Use monitoring tools to identify this situation ahead of escalation.
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely extension of the slow data storage size
Recovery:
- Remove unnecessary data from the slow storage space.
- Adjust the life cycle configuration so that the data are removed from slow storage space sooner.
Loss of the system drive in one node of the cluster¶
Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low
System drives are in RAID 1 array so the loss of one drive is non-critical. Ensure quick replacement of the failed drive to prevent a second fast drive failure. A second system drive failure will escalate to a "Loss of the node in the cluster".
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely replacement of the failed drive
Recovery:
- Replace failed fast storage drive ASAP (how swap)
- Verify correct RAID1 array reconstruction
System storage space shortage¶
Impact: Moderate (3)
Likelihood: Rare (1)
Risk level: low
Use monitoring tools to identify this situation ahead of escalation.
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Remove unnecessary data from the system storage space.
- Contact the support or the vendor.
Loss of the network connectivity in one node of the cluster¶
Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Redundant network connectivity
Recovery:
- Restore the network connectivity
- Verify the proper cluster operational condition
Failure of the ElasticSearch cluster¶
Impact: Major (4)
Likelihood: Possible (3)
Risk level: medium-high
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating ElasticSearch cluster health
Recovery:
- Contact the support and/or vendor and consult the strategy.
Failure of the ElasticSearch node¶
Impact: Minor (2)
Likelihood: Likely (4)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating ElasticSearch cluster health
Recovery:
- Monitor an automatic ElasticSearch node rejoining to the cluster
- Contact the support / the vendor if the failure persists over several hours.
Failure of the Apache Kafka cluster¶
Impact: Major (4)
Likelihood: Rare (1)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating Apache Kafka cluster health
Recovery:
- Contact the support and/or vendor and consult the strategy.
Failure of the Apache Kafka node¶
Impact: Minor (2)
Likelihood: Rare (1)
Risk level: low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating Apache Kafka cluster
Recovery:
- Monitor an automatic Apache Kafka node rejoining to the cluster
- Contact the support / the vendor if the failure persists over several hours.
Failure of the Apache ZooKeeper cluster¶
Impact: Major (4)
Likelihood: Rare (1)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating Apache ZooKeeper cluster
Recovery:
- Contact the support and/or vendor and consult the strategy.
Failure of the Apache ZooKeeper node¶
Impact: Insignificant (1)
Likelihood: Rare (1)
Risk level: low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating Apache ZooKeeper cluster
Recovery:
- Monitor an automatic Apache ZooKeeper node rejoining to the cluster
- Contact the support / the vendor if the failure persists over several hours.
Failure of the stateless data path microservice (collector, parser, dispatcher, correlator, watcher)¶
Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Restart the failed microservice.
Failure of the stateless support microservice (all others)¶
Impact: Insignificant (1)
Likelihood: Possible (3)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Restart the failed microservice.
Significant reduction of the system performance¶
Impact: Moderate (3)
Likelihood: Possible (3)
Risk level: medium-high
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Identify and remove the root cause of the reduction of the performance
- Contact the vendor or the support if help is needed
Backup and recovery strategy¶
Offline backup for the incoming logs¶
Incoming logs are duplicated to the offline backup storage that is not part of the active cluster of LogMan.io (hence is "offline"). Offline backup provides an option to restore logs to the LogMan.io after critical failure etc.
Backup strategy for the fast data storage¶
Incoming events (logs) are copied into the archive storage once they enter the LogMan.io. It means that there is always the way how to “replay” events into the TeskaLabs LogMan.in in case of need. Also, data are replicated to other nodes of the cluster immediately after arrival to the cluster. For this reason, traditional backup is not recommended but possible.
The restoration is handled by the cluster components by replicating the data from other nodes of the cluster.
Backup strategy for the slow data storage¶
The data stored on the slow data storage are ALWAYS replicated to other nodes of the cluster and also stored in the archive. For this reason, traditional backup is not recommended but possible (consider the huge size of the slow storage).
The restoration is handled by the cluster components by replicating the data from other nodes of the cluster.
Backup strategy for the system storage¶
It is recommended to periodically backup all file systems on the system storage so that they could be used for restoring the installation when needed. The backup strategy is compatible with most common backup technologies in the market.
- Recovery Point Objective (RPO): full backup once per week or after major maintenance work, incremental backup one per day.
- Recovery Time Objective (RTO): 12 hours.
Note
RPO and RTO are recommended, assuming highly available setup of the LogMan.io cluster. It means three and more nodes so that the complete downtime of the single node don’t impact service availability.
Generic backup and recovery rules¶
-
Data Backup: Regularly backup to a secure location, such as a cloud-based storage service, backup tapes, to minimize data loss in case of failures.
-
Backup Scheduling: Establish a backup schedule that meets the needs of the organization, such as daily, weekly, or monthly backups.
-
Backup Verification: Verify the integrity of backup data regularly to ensure that it can be used for disaster recovery.
-
Restoration Testing: Test the restoration of backup data regularly to ensure that the backup and recovery process is working correctly and to identify and resolve any issues before they become critical.
-
Backup Retention: Establish a backup retention policy that balances the need for long-term data preservation with the cost of storing backup data.
Monitoring and alerting¶
Monitoring is an important component of a Continuity Plan as it helps to detect potential failures early, identify the cause of failures, and support decision-making during the recovery process.
LogMan.io microservices provides OpenMetrics API and/or ship their telemetry into InfluxDB and uses Grafana as a monitoring tool.
-
Monitoring Strategy: OpenMetrics API is used to collect telemetry from all microservices in the cluster, Operating system and hardware. Telemetry is collected once per minute. InfluxDB is used to store the telemetry data. Grafana is used as the Web-based User interface for telemetry inspection.
-
Alerting and Notification: The monitoring system is configured to generate alerts and notifications in case of potential failures, such as low disk space, high resource utilization, or increased error rates.
-
Monitoring Dashboards: Monitoring dashboards are provided in Grafana that display the most important metrics for the system, such as resource utilization, error rates, and response times.
-
Monitoring Configuration: Regularly reviews and updates are provided for the monitoring configuration to ensure that it is effective and that it reflects changes in the system.
-
Monitoring Training: Trainings are provided for the monitoring team and other relevant parties on the monitoring system and the monitoring dashboards in Grafana.
High availability architecture¶
TeskaLabs LogMan.io is deployed in a highly available architecture (HA) with multiple nodes to reduce the risk of single points of failure.
High availability architecture is a design pattern that aims to ensure that a system remains operational and available, even in the event of failures or disruptions.
In a LogMan.io cluster, a high availability architecture includes the following components:
-
Load Balancing: Distribution of incoming traffic among multiple instances of microservices, thereby improving the resilience of the system and reducing the impact of failures.
-
Redundant Storage: Storing of data redundantly across multiple storage nodes to prevent data loss in the event of a storage failure.
-
Multiple Brokers: Use multiple brokers in Apache Kafka to improve the resilience of the messaging system and reduce the impact of broker failures.
-
Automatic Failover: Automatic failover mechanisms, such as leader election in Apache Kafka, to ensure that the system continues to function in the event of a cluster node failure.
-
Monitoring and Alerting: Usage of monitoring and alerting components to detect potential failures and trigger automatic failover mechanisms when necessary.
-
Rolling Upgrades: Upgrades to the system without disrupting its normal operation, by upgrading nodes one at a time, without downtime.
-
Data Replication: Replication of log across multiple cluster nodes to ensure that the system continues to function even if one or more nodes fail.
Communication plan¶
A clear and well-communicated plan for responding to failures and communicating with stakeholders helps to minimize the impact of failures and ensure that everyone is on the same page.
-
Stakeholder Identification: Identify all stakeholders who may need to be informed during and after a disaster, such as employees, customers, vendors, and partners.
-
Participating organizations: The LogMan.io operator, the integrating party and the vendor (TeskaLabs).
-
Communication Channels: Communication channels that will be used during and after a disaster are Slack, email, phone and SMS.
-
Escalation Plan: Specify an escalation plan to ensure that the right people are informed at the right time during a disaster, and that communication is coordinated and effective.
-
Update and Maintenance: Regularly update and maintain the communication plan to ensure that it reflects changes in the organization, such as new stakeholders or communication channels.