SPOG.AI replaces fragmented tools with one cyber-mesh to automate audits, manage incidents, and govern infrastructure

High CPU Utilization of Server(%) Linux OS

High CPU utilization can result in performance issues and resource constraints on the server. The playbook helps in identifying the resource-intensive processes and help reduce their impact on the system.

View Playbook

High Memory Utilization of Server(%) Linux OS

High memory utilization over a period of time can cause problems. The playbook helps in determine the reason for this increase including memory consumption by processes, incorrect system or underlying host configuration, etc.

View Playbook

Too many processes running on the host Linux OS

When hosts run excessive processes, performance and resources may degrade. This playbook aids in identifying processes, analyzing their resource consumption, and detecting any misbehaving processes, such as forkbombs, which could lead to system limits being reached.

View Playbook

Swap Space filling Up Linux OS

A Linux swap space issue occurs when swap usage becomes inefficient, indicating performance concerns. While swap handles memory demands, excessive use suggests underlying issues. The playbook aids in analyzing swap usage and fine-tuning system settings for improved performance.

View Playbook

Clock Skew on Host Linux OS

Inaccurate system clock timing can lead to synchronization issues and timestamp discrepancies across systems, causing inaccurate logging, authentication failures, and troubleshooting complexities. The playbook assists in collecting diagnostic data to identify and resolve the underlying causes, ensuring system reliability.

View Playbook

Unexpected Reboot of Server Linux OS

An unexpected server reboot can disrupt services and risk data loss or corruption. This playbook aids in pinpointing the cause of the reboot, enabling fixes to address underlying issues and prevent future disruption.

View Playbook

Increase in DB Replication Lag RDS

An increase in replication lag between master and slave databases can result in data inconsistency and synchronization delays, impacting application functionality, user experience and recovery objectives. The playbook assists in identifying the cause of this increase to mitigate its impact.

View Playbook

Low space on server(%) Linux OS

High space utilization on a server can cause performance degradation, system instability, and service disruptions. This playbook assists in identifying the sources of disk space consumption and provides strategies for swift recovery.

View Playbook

Increase in Slow Queries on DB Cloud Platform

Slow queries can result in degraded performance and responsiveness of the system. The playbook will help identify the slow queries which are currently running and it's impact on the system performance.

View Playbook

Too many Connections on DB Cloud Platform

Excessive database connections can lead to performance degradation, resource exhaustion, and service disruptions. This playbook assists in identifying the root causes, such as slow queries or high traffic from application or batch servers, enabling prompt resolution

View Playbook

Unusual Drop in Select Queues on DB RDS

When the volume of transactions significantly decreases, it may indicate underlying problems such as application failures, database misconfigurations, or network issues. The playbook helps in identify the reason like long running querues or locaked tables.

View Playbook

Increase in threads on MYSQL RDS

When the number of threads exceeds the capacity of the MySQL server to handle them efficiently, it can lead to slow query processing, timeouts, and even crashes. The plyabook helps indentify the reason for this increase including long running queries or increased traffic.

View Playbook

Degradation of CDN Performance Akamai

CDN performance degradation can cause low website loading times, decreased content delivery speeds, and poor user experience. The playbook helps in identifying bottlenecks, latency issues, and connectivity problems along the delivery path.

View Playbook

Session lost with BGP Peer Network

When a BGP peer connection is lost, it can lead to traffic redirection issues, route flapping, and difficulty in reaching certain destinations on the internet. The playbook will help identiify the reason for the loss of BGP peer session including problem with last mile.

View Playbook

Degradation of Application Performance Performance

Multiple factors can cuase an application performacne to degrade this playbook checks APM agent summary, errors, database transactions etc.

View Playbook

High CPU Utilization of IoWait(%) Cloud Platform

Increase in server high I/O wait times can lead to delays in executing tasks, sluggish application performance, and increased response times for user requests. The playbook helps identify factors including disk bottlenecks, inefficient storage configurations, or heavy disk activity.

View Playbook

Increase in HTTP Error Rate Availability

Increase in HTTP error rate (5xx requests) can cause user dis-satisfaction on site. This playbook gives us insight about any change in baseline of HTTP error rate in last 30 minutes

View Playbook

Active Alerts on Server Group Cloud Platform

Applications typically involve multiple servers, including load balancer members, connected databases, utilized cache servers, and other interacting service points. This playbook aids in validating the presence of active alerts on any dependent servers

View Playbook

Degradation of Network Performance on Server Linux OS

A decline in the efficiency and speed of network operations managed by a server. This can manifest as slower data transmission, increased latency, reduced bandwidth, or frequent connectivity issues. The playbook checks server to public ulr connectivity diagnosis, this includes MTR, DIG or cURL

View Playbook

Identfiy WIfi User from IP Address Network

Inability to determine the user assigned to an IP address flagged in a malware alert on the firewall can hindered incident response. The playbook will help in identfying the user if it's mapped to an WIFI IP address block of your office or branch Location.

View Playbook

Possible Malware Identified on Endpoint Security

Timely and effective response to potential malware threats is crucial for minimizing the impact of security incidents on the organization's network and systems. This playbook streamlines threat remediation by verifying unknown malware through VirusTotal, and leveraging Cortex XDR APIs to block malicious files and schedule system scans for users.

View Playbook

Kubernetes High Disk Pressure Kubernetes

Disk pressure within Kubernetes nodes can quickly escalate into critical incidents, posing a threat to cluster stability. This playbook help us to mitigate risks. This involves pinpointing the root cause, alleviating disk space constraints, and establishing proactive monitoring measures to forestall future occurrences, ensuring uninterrupted cluster operations."

View Playbook

High CPU Utilization on Network Devices Network

This will help in identifying the reason for high cpu and will help in troubleshooting the issue better

View Playbook

K8s Deployment Replica Check Kubernetes

This incident type in Kubernetes signals disparities between the intended and actual numbers of replica pods, This playbook help in finding a potential misalignment with deployment specifications. It prompts a thorough investigation and corrective measures to realign deployments with defined specifications, ensuring consistency and reliability in cluster operations.

View Playbook

Kubernetes Pod ImagePullBackOff Kubernetes

The 'ImagePullBackOff' error in Kubernetes arises when a pod encounters difficulties fetching its container image from the designated repository. Common causes include authentication issues, incorrect image names, or network disruptions. This playbook aids in troubleshooting to rectify the image retrieval failure and guarantee smooth pod deployment within the cluster.

View Playbook

Kubernetes - Pod status not ready Kubernetes

This incident type occurs when a Kubernetes pod is not ready to accept traffic due to a configuration problem or a change in the pod specification. It can cause spikes in the number of unavailable pods and impact the performance of the system. The playbook will help in investigation and provide resolution to restore system stability and ensure uninterrupted service delivery.

View Playbook

Kubernetes Cronjob Failure Detected Kubernetes

A Kubernetes Cronjob Failure incident occurs when a scheduled task, or cronjob, in a Kubernetes cluster fails to execute as expected. The playbook task starts by fetching logs and status information related to the failed Cronjob, helping identify the root cause of the failure.

View Playbook

Nginx Ingress Controller Failure to Start on Kubernetes Kubernetes

The incident may be caused by various factors such as resource constraints on the node, affinity rules, image pull issues, misconfigured RBAC, missing default server TLS Secret, missing or invalid annotations, and invalid values of ConfigMap keys. This playbook helps troubleshoot the failure to start the Nginx Ingress Controller on Kubernetes by automating various diagnostic steps.

View Playbook

Kubernetes DNS Resolution Failure Detected Kubernetes

DNS resolution failure is a common incident type that occurs when services are unable to resolve domain names into IP addresses, leading to communication failure. This playbook against your Kubernetes nodes, you can gather relevant information to diagnose and troubleshoot DNS resolution failures in your cluster.

View Playbook

Kubernetes Nodes Experiencing PID Pressure Kubernetes

Nodes with PID Pressure in Kubernetes is an incident type that occurs when a Kubernetes cluster node experiences PID pressure, meaning that it may not be able to start more containers. This playbook helps diagnose and address Kubernetes nodes experiencing PID pressure by providing insights into the PID limits and current process counts on the cluster.

View Playbook

Unauthorized Pod Execution Alert Kubernetes

Unauthorized Pod Execution is an incident type that occurs when an unauthorized entity attempts to create a pod in a system without proper permissions. This playbook against your Kubernetes master node(s), you can investigate unauthorized pod execution alerts and gather relevant information to understand how the unauthorized pod was deployed and any associated events.

View Playbook

Cert-Manager Deployment on Kubernetes Encounters Certificate Generation Issue Kubernetes

This incident type involves an issue with generating certificates using the cert-manager deployment on Kubernetes. The incident description outlines steps for troubleshooting the issue, including checking the certificate resource, ingress annotations, and issuer state. This playbook against your Kubernetes master node(s), you can gather relevant information to diagnose certificate generation issues with Cert-Manager.

View Playbook

Kubernetes Pods Stuck in Pending State Kubernetes

Kubernetes Pods Pending incident indicates that one or more pods in a Kubernetes cluster are not running as expected and are in a pending state. This can happen due to various reasons such as resource constraints, scheduling issues, or network problems. This playbook helps in investigating and resolving the issue of Kubernetes pods stuck in the "Pending" state by performing several diagnostic tasks.

View Playbook

Unauthorized Access to Kubernetes API Server Detected Kubernetes

This incident type is characterized by the detection of unauthorized access to the Kubernetes API server. This unauthorized access potentially enables attackers to manipulate cluster resources. This playbook is designed to help investigate and respond to unauthorized access incidents detected on the Kubernetes API server.

View Playbook

Kubernetes Pods Stuck in CrashloopBackOff State Kubernetes

CrashloopBackOff on a Kubernetes Pod is an incident that occurs when a container running in a Kubernetes Pod repeatedly crashes immediately after starting up. This playbook helps streamline the management and maintenance of your Kubernetes environment by automating the detection of problematic pods.

View Playbook

Node Unavailability in Kubernetes Cluster Kubernetes

Node Not Ready in Kubernetes Cluster is an incident type that occurs when a node in a Kubernetes cluster fails to respond, is unresponsive, or is not ready to take on workloads. This playbook helps in managing node unavailability issues in a Kubernetes cluster in several ways.

View Playbook

Kubernetes Deployment With Multiple Restarts Kubernetes

The incident type of "Kubernetes deployment with multiple restarts" indicates that a Kubernetes deployment has experienced multiple restarts within a certain timeframe, which is usually indicative of a problem. This playbook provides visibility into Kubernetes deployments that have experienced multiple restarts, helping you identify potential issues or problematic deployments in your cluster.

View Playbook

High Pod Count per Node in Kubernetes Cluster Kubernetes

This incident type is related to high pod count per node in a Kubernetes cluster. This can happen due to various reasons such as misconfigurations, resource constraints, or issues with the application itself. This playbook helps in addressing the issue of high pod count per node in a Kubernetes.

View Playbook

Kubernetes Node Performance Amid Memory Pressure Kubernetes

The Kubernetes Nodes with Memory pressure incident type occurs when a Kubernetes cluster node is running low on memory, which can be caused by a memory leak in an application. This playbook provides a basic framework for monitoring memory pressure across all nodes in a Kubernetes cluster.

View Playbook

Kubernetes Cluster Helath Status Kubernetes

This incident type is related to Kubernetes cluster components health status and genrate alert if any error event observed. This playbook contributes to the reliability and stability of your Kubernetes cluster by enabling proactive monitoring and automated responses to health issues.

View Playbook

Uptime for Network devices Network

This will help in identifying the reason for high cpu and memory and will help in troubleshooting the issue better

View Playbook

Inode Exhaustion Incident. Linux OS

Such occurrences can result in system instability and impair users' access to and utilization of files and data. This playbook is designed to identify the directories or files with the highest inode usage.

View Playbook

Input/Output Errors and Data Corruption on Linux Systems Linux OS

Such errors may result in boot difficulties for users, necessitating troubleshooting to pinpoint and unmount the impacted device. This playbook is designed to identify any system errors or problematic system mount points.

View Playbook

Identifying Critical Kernel Issues. Linux OS

This could result in system instability, data loss, and various other issues. This playbook is designed to identify any kernel taints or hardware errors that may be contributing to the problem.

View Playbook

SSH Connection Disruptions Linux OS

In such occurrences, users encounter difficulty accessing essential remote resources, potentially causing disruptions in business operations and productivity. This playbook aims to discern whether any firewall rules are impeding access.

View Playbook

Host Connection Tracking Limit Incident Linux OS

Exceeding the host's connection tracking limit can lead to network connectivity disruptions and adversely impact host performance. The playbook will help in identfying the current number of established connectiones with a port/IP address."

View Playbook

DDoS Assault Targeting Apache HTTP Server Apache

This situation can render the server inaccessible to users, disrupting regular operations. This playbook is designed to identify any potential flooding of SYN packets on the server.

View Playbook

Unauthorized Directory Access Detected in Apache HTTPD Server Apache

Such incidents pose a significant risk of exposing or even pilfering sensitive data, jeopardizing the security of both the server and its associated applications. This playbook is intended to identify virtual host configurations, permissions, and any associated errors.

View Playbook

Apache HTTP Server Restart Apache

It may result in a transient outage or interruption of a web service. This playbook is designed to pinpoint causes such as software updates, configuration modifications, or hardware malfunctions.

View Playbook

Increased Apache Workers Load Apache

This could lead to sluggish or non-responsive websites. This playbook aims to determine when the number of workers reaches its maximum limit or when a substantial volume of requests is being processed concurrently.

View Playbook

Apache Cross-Origin Resource Sharing (CORS) Issues Apache

It prohibits web pages from making requests to domains other than the one serving the page. This playbook is aimed at identifying any directives or errors related to Cross-Origin Resource Sharing (CORS).

View Playbook

Heightened Memory Consumption by Apache Apache

This can lead to system unresponsiveness or sluggishness, causing performance degradation. This playbook is designed to identify memory usage by Apache and its associated modules.

View Playbook

Apache File Upload Size Limit Exceeded Apache

This occurrence may lead to the interruption of the file upload process, causing delays in system operation. This playbook is designed to identify instances where the file size exceeds the server's configured limit.

View Playbook

Apache Child Process Overflow Apache

When the Apache server hits a critical threshold, it triggers performance issues and potentially leads to server crashes. This playbook aims to pinpoint factors such as high traffic volume or misconfigured server settings.

View Playbook

Apache Server Outage Apache

This could lead to service disruptions and affect the availability of applications and services hosted on the server. This playbook is designed to identify Apache service status, configuration file errors, and firewall rule issues.

View Playbook

Failures of Apache Server's Mod_JK Workers Apache

Failure of this module can result in difficulties serving web pages and accessing web applications. This playbook is aimed at identifying what is obstructing traffic to or from Apache or Tomcat.

View Playbook

Elevated CPU Usage and Persistent Connections in Apache Server Apache

This type of incident can affect the performance of the Apache server. This playbook is designed to identify the maximum number of simultaneous connections to the server.

View Playbook

Detection of URL Redirection Loops in Apache Apache

This may result in performance degradation and affect the availability of the web server. This playbook aims to detect such loops to prevent issues and ensure the proper functioning of the web server.

View Playbook

Insufficient SSL/TLS Configuration for Apache HTTP Server Apache

If improperly configured, it becomes susceptible to exploitation by hackers for intercepting or altering sensitive data. This playbook is designed to identify SSL certificates, their associated files, versions, and any potential errors.

View Playbook

Nginx File Upload Size Limit Breached Nginx

An incident where Nginx reports a file upload size limit exceeded typically occurs when a user attempts to upload a file to a web application through an Nginx server, but the size of the uploaded file exceeds the configured limit in Nginx. This playbook, you effectively handle the incident by increasing the file upload size limit in Nginx, mitigating the impact on users who may have encountered errors due to the previous limit being exceeded.

View Playbook

ETag Header Server Information Leak Protection Nginx

This vulnerability could compromise the confidentiality and integrity of the server's data and infrastructure. This plabook helps to disabled the ETag header in Nginx configuration to prevent the disclosure of file metadata.

View Playbook

SSL Certificate Expiry nginx incedent Nginx

If the SSL certificate expires and is not promptly renewed, it can lead to a disruption of service for users trying to access the website or application. This playbook streamlines the SSL certificate update process in Nginx servers, making it faster, more reliable, and easier to manage.

View Playbook

Elevated 5xx Errors Detected on Nginx Server Nginx

The issue pertains to an abnormal increase in 5xx errors on an nginx web server, indicating server-side problems impacting user experience. These errors commonly result from issues such as resource limitations, configuration errors, upstream server problems, or network connectivity issues. This playbook streamlines the troubleshooting and resolution process, reducing manual effort and ensuring a more consistent and efficient response to the issue of higher 5xx errors on the nginx server.

View Playbook

Defending Against Clickjacking Threats Nginx

Clickjacking is a deceptive tactic used by attackers to trick users into clicking on malicious elements disguised as legitimate ones on a website. This can lead to unauthorized actions, such as transferring funds, changing account settings, or revealing sensitive information, posing significant security risks to both users and the affected website. This playbook will help protect your Nginx servers from clickjacking attacks by configuring security headers to prevent unauthorized framing and ensure a more secure browsing experience for users.

View Playbook

High 4xx Error Rates On Nginx Nginx

This incident involves a significant increase in 4xx errors on NGINX upstreams, warranting investigation to determine the underlying cause. This playbook assists in quickly identifying and mitigating the issue of high 4xx errors on NGINX upstreams, allowing for a more efficient incident response process and potentially reducing service disruption for users.

View Playbook

Cross-Origin Resource Sharing (CORS) Issues in Nginx Nginx

The incident type pertains to the challenges arising from making cross-origin resource sharing (CORS) requests to an Nginx web server. CORS serves as a security measure in web browsers, aiming to prevent unauthorized access to resources across domains. However, when Nginx is not properly configured to allow CORS requests, browsers will block these requests, resulting in errors. This playbook helps resolve CORS errors in Nginx by configuring the server to allow cross-origin requests.

View Playbook

Content-Type Discrepancy in Nginx Server Responses Nginx

An incident involving a mismatch of Content-Type in Nginx typically occurs when the server sends a response with a Content-Type header that does not match the actual content being delivered. This can lead to unexpected behavior in web applications or browsers. This playbook ensures that Nginx is configured with correct MIME types, potentially resolving Content-Type mismatch issues.

View Playbook

Nginx Upstream Peer Failure Detected Nginx

The incident involving NGINX upstream peers failing suggests a disruption in the proper functioning of the NGINX server's upstream connections. These failures are significant enough to trigger alerts, which provide detailed metrics on the percentage of failures and how they deviate from anticipated values over a specific period, aiding in the assessment of severity. This playbook aims to address issues with NGINX upstream peers failing by performing several actions like restart nginx and print debugging information.

View Playbook

Countering DDoS Threats with NGINX Nginx

This incident involves the implementation of NGINX, a widely used web server software, to defend against Distributed Denial of Service (DDoS) attacks. DDoS attacks involve overwhelming a server with a flood of traffic from multiple sources, resulting in a denial of service for legitimate users. This playbook installs NGINX, configures it with basic settings for DDoS mitigation (like request rate limiting and connection limiting), and ensures that the NGINX service is enabled and started.

View Playbook

NGINX Latency Surge Detected Nginx

This issue may lead to website downtime, sluggish page loads, and degraded user experience. High Nginx latency typically refers to a situation where the response time of Nginx servers is significantly slower than expected. This playbook is designed to address high Nginx latency by performing several key actions like check network traffic, Implement Caching etc.

View Playbook

Tomcat SSL Handshake Failure Tomcat

This can result in downtime or service disruptions for users attempting to access the affected service. This playbook will help identify issues such as incorrect SSL certificate configurations, cipher suite mismatches, or network connectivity problems.

View Playbook

Tomcat Frequent Full Garbage Collection Tomcat

This can lead to performance issues and application downtime. This playbook is designed to diagnose Java heap usage and analyze GC statistics for the Tomcat process.

View Playbook

Tomcat Server Failure Tomcat

This can cause a complete or partial system outage, disrupting the delivery of web applications and services. This playbook is designed to identify factors such as misconfigurations, network or hardware issues, software bugs, or excessive server load.

View Playbook

Tomcat HTTP Request Headers or Payload Size Exceeds Configured Limits Tomcat

In such instances, the server might decline the request or issue an error message, disrupting the regular operation of the web application. This playbook is tailored to pinpoint factors such as large file uploads, excessive utilization of cookies or query parameters, or deliberate attacks sending oversized requests.

View Playbook

Tomcat High JDBC Connection Pool Utilization Incident Tomcat

This situation may lead to a shortage of available connections for incoming requests, resulting in slow response times or potential server crashes. This playbook is designed to evaluate the utilization of the JDBC connection pool and the current status of database connections in use.

View Playbook

Tomcat Thread Pool Depletion Tomcat

In the event of its occurrence, incoming requests will remain unprocessed, resulting in service degradation or a complete outage. This playbook aims to identify thread usage and the status of the JVM.

View Playbook

Tomcat High Volume of Active Sessions. Tomcat

This could result in slow response times, unresponsive applications, or even server crashes. This playbook is crafted to identify the current CPU and memory usage of the Tomcat server, along with monitoring current network connections to the Tomcat server.

View Playbook

Tomcat High Swap Utilization Tomcat

This could slow down the server or lead to a crash, causing downtime for the hosted website or application. This playbook is designed to pinpoint which Tomcat processes are consuming the most swap space and memory.

View Playbook

Tomcat Elevated Memory Usage Tomcat

This can result in system instability and performance degradation. This playbook aims to identify the root causes such as memory leaks, misconfigurations, or issues within the application code.

View Playbook

Tomcat JVM OutOfMemory Tomcat

This leads to server crashes or unresponsiveness. This playbook is designed to identify causes such as incorrect configuration settings, memory leaks, or inadequate memory allocation.

View Playbook

Tomcat Experiencing High Volume of Suspended Threads Tomcat

In such instances, the system may become unresponsive or slow, leading to performance degradation or even a total system failure. This playbook is designed to identify slow database queries, network latency issues, or resource contention.

View Playbook

Frequent Halts in Tomcat Threads Tomcat

It can cause delays in server response time and potentially lead to application downtime. This playbook is intended to diagnose issues such as incorrect configuration, excessive server load, or memory leaks in the application code.

View Playbook

Outdated Plugins Detected in Jenkins Jenkins

Such incidents may pose potential security vulnerabilities, performance concerns, and compatibility issues with other plugins or applications that depend on them. This playbook is crafted to diagnose the versions of installed plugins and identify available updates for them.

View Playbook

Jenkins Master Server Outage Jenkins

A Jenkins master server failure can disrupt the automation pipeline, resulting in delays in software development and deployment. This playbook is designed to identify the status of replicas, network connections, and other relevant factors.

View Playbook

Elevated Percentage of Blocked Items in Jenkins Queue Jenkins

Such occurrences can prolong software development and deployment processes, leading to extended wait times for developers and ultimately delaying time-to-market for software products. This playbook aims to diagnose issues such as slow build or test execution times, network or server issues, or any other performance-related problems.

View Playbook

Large Number of Blocked Jobs in Jenkins Queue in K8S Jenkins

This may result in delays in job execution and could signify a broader issue within the system. This playbook is designed to diagnose the status of the Jenkins pod, queue status, and executor status.

View Playbook

Jenkins Run Failure Jenkins

This could result in substantial downtime and disrupt the software development and delivery process. Utilizing this playbook, we can identify triggering factors such as misconfiguration, code errors, or infrastructure issues.

View Playbook

Jenkins Build Health Score Jenkins

This incident pertains to concerns regarding the health score of Jenkins builds, which might signal failures or suboptimal performance. This playbook is designed to identify issues related to Jenkins reachability, plugin status, and configurations.

View Playbook

High Backend Session Usage Detected HAProxy

High backend session usage in HAProxy can lead to service disruptions, degraded performance, negative user experiences, increased operational overhead, missed opportunities, and regulatory compliance risks. This playbook helps mitigate the issue of high backend session usage in HAProxy by increasing the maximum connections allowed in the HAProxy configuration.

View Playbook

Conquering Client-Side Request Errors HAProxy

The incident is triggered when the error rate deviates from the expected value and surpasses a predefined threshold. It can significantly impact the end-users' experience and necessitates prompt attention to mitigate any potential further problems. This playbook helps streamline the troubleshooting process and reduce the time required to identify and resolve issues with HAProxy.

View Playbook

Unusual Surge in 4xx Frontend HTTP Responses for Host HAProxy

Anomalous frontend 4xx HTTP responses, potentially stemming from HAProxy issues, can disrupt services, diminish user experience, and incur financial losses for businesses. This playbook streamlines the process of resolving HAProxy-related issues that may be contributing to the occurrence of anomalous 4xx HTTP responses. Anomalous frontend 4xx HTTP responses, potentially stemming from HAProxy issues, can disrupt services, diminish user experience, and incur financial losses for businesses. This playbook streamlines the process of resolving HAProxy-related issues that may be contributing to the occurrence of anomalous 4xx HTTP responses.

View Playbook

Elasticsearch Disk Space Exhaustion Elasticsearch

Such occurrences can lead to various issues including slow performance, inability to index new data, and even system crashes. This playbook is designed to diagnose disk usage, cluster health, and shard status on Elasticsearch instances.

View Playbook

Elevated Query Load on Elasticsearch Service Elasticsearch

This may lead to slower query response times and diminished system performance. This playbook is tailored to diagnose query load, query performance of Elasticsearch, and assess cluster state.

View Playbook

Nodes in Elasticsearch with Version Discrepancies Elasticsearch

This could lead to problems with data indexing, querying, and retrieval. This playbook is designed to identify the Elasticsearch version on each node, assess cluster health, and monitor its status.

View Playbook

Elasticsearch Instability and Cluster Failures Elasticsearch

These issues can significantly impact system performance, resulting in downtime and potential data loss. This playbook is crafted to diagnose cluster health and statistics, index health and statistics, as well as node information and statistics.

View Playbook

Elasticsearch Virtual Memory Limitation Elasticsearch

This incident can result in system downtime and affect the performance of applications dependent on Elasticsearch. This playbook is designed to obtain information regarding memory usage and allocation to Elasticsearch.

View Playbook

Analyzing the Container Absence | Docker Container

A "container absent" issue occurs when a container that is expected to be running within a containerized environment like Kubernetes or Docker is not present or has unexpectedly stopped. This can lead to various problems such as application unavailability, service interruptions, or unexpected behavior, depending on the criticality of the container and its services. The playbook helps in identifying the root cause, which could be anything from resource constraints to misconfigurations, network issues, or software bugs.

View Playbook

Docker Image Pull Error - Retrieval Failure Container

A Docker Image Pull Failure refers to an incident in which the Docker engine is unable to retrieve a particular image from a container registry. This plabook streamline the troubleshooting process, reduce manual effort, and potentially expedite the resolution of Docker Image Pull Failures.

View Playbook

Tackling Docker Network Routing Hurdles Container

When Docker containers encounter a network routing issue, it signifies a breakdown in communication among containers due to network configuration problems. This could arise from improper routing of network traffic between containers or conflicts involving IP addresses or port numbers. This plabook helps in understanding the current state of the Docker network setup and identifying potential issues affecting routing.

View Playbook

Docker Volume Mounting Incident Container

This can happen due to various reasons such as incorrect volume paths, permissions issues, or conflicts with existing volumes. As a result of this incident, Docker containers may fail to access the required data or configurations stored in volumes, leading to application errors or malfunctions. The playbook assists in identifying the cause of this and potentially speeds up the resolution time.

View Playbook

Conflict In Container Names Incident Container

A conflict in container names incident occurs when there are multiple Docker containers running on the same host with identical names. As a result of this incident, Docker may fail to start or manage containers properly, leading to service disruptions or unexpected behavior in containerized applications. This playbook helps to resolve conflicts in container names by taking the few actions like stop and remove conflicting containers, rename or recreate containers.

View Playbook

Insufficient Replication in Cassandra Cluster Cassandra

This could lead to data loss in the event of one or more node failures. This playbook is designed to identify the number of nodes in the cluster, the replication factor, and the replication strategy.

View Playbook

Cassandra Version Inconsistencies in Cluster Cassandra

This could result in various issues including data inconsistencies, node failures, and performance degradation. The playbook is intended to identify the Cassandra version on all nodes, assess the status of all nodes in the cluster, determine the schema version of the cluster, and evaluate the status of the node's anti-entropy service.

View Playbook

Cassandra Connection Timeout Cassandra

This could potentially affect the availability and performance of the Cassandra service. This playbook is designed to investigate the status of the Cassandra cluster and address any connection issues that may arise.

View Playbook

Slow Query Performance on Cassandra Cassandra

Slow queries can degrade performance and impact system efficiency. This playbook is aimed at investigating the root cause of slow queries, optimizing database configurations, and refining query performance.

View Playbook

Slow Query Execution on Cassandra Cluster Cassandra

This delay can render the system unresponsive and degrade performance. The playbook is designed to identify various factors contributing to this, such as increased traffic, inefficient queries, or hardware issues.

View Playbook

Elevated Average Queue Size on Cassandra Cassandra

When the average queue size is excessively high, it may indicate that the database is struggling to handle incoming requests, potentially leading to slower response times and service disruptions. This playbook is tailored to diagnose queue size, the number of connections, and the number of requests per second to address performance issues.

View Playbook

Disk Latency Issues in Cassandra Cluster Cassandra

This may result in performance issues and could potentially lead to data loss or downtime. The playbook is designed to diagnose disk usage, I/O operations, and read/write performance to address any underlying issues.

View Playbook

Cassandra Coordinator Query Latency Timeout Cassandra

If queries cannot be processed quickly enough, clients may experience timeouts, preventing them from retrieving the necessary data. This playbook is designed to diagnose multiple factors contributing to this issue, such as high load on the cluster, network issues, or hardware problems.

View Playbook

High Memmory Utilization on Network Devices Network

This will help in identifying the reason for high memory and will help in troubleshooting the issue better.

View Playbook

Nginx Failover Check Nginx

These steps facilitate quick identification of Nginx failover causes, streamlining troubleshooting for swift issue resolution and enhanced system reliability.

View Playbook

Runbook Library

High CPU Utilization of Server(%) Linux OS

High Memory Utilization of Server(%) Linux OS

Too many processes running on the host Linux OS

Swap Space filling Up Linux OS

Clock Skew on Host Linux OS

Unexpected Reboot of Server Linux OS

Increase in DB Replication Lag RDS

Low space on server(%) Linux OS

Increase in Slow Queries on DB Cloud Platform

Too many Connections on DB Cloud Platform

Unusual Drop in Select Queues on DB RDS

Increase in threads on MYSQL RDS

Degradation of CDN Performance Akamai

Session lost with BGP Peer Network

Degradation of Application Performance Performance

High CPU Utilization of IoWait(%) Cloud Platform

Increase in HTTP Error Rate Availability

Active Alerts on Server Group Cloud Platform

Degradation of Network Performance on Server Linux OS

Identfiy WIfi User from IP Address Network

Possible Malware Identified on Endpoint Security

Kubernetes High Disk Pressure Kubernetes

High CPU Utilization on Network Devices Network

K8s Deployment Replica Check Kubernetes

Kubernetes Pod ImagePullBackOff Kubernetes

Kubernetes - Pod status not ready Kubernetes

Kubernetes Cronjob Failure Detected Kubernetes

Nginx Ingress Controller Failure to Start on Kubernetes Kubernetes

Kubernetes DNS Resolution Failure Detected Kubernetes

Kubernetes Nodes Experiencing PID Pressure Kubernetes

Unauthorized Pod Execution Alert Kubernetes

Cert-Manager Deployment on Kubernetes Encounters Certificate Generation Issue Kubernetes

Kubernetes Pods Stuck in Pending State Kubernetes

Unauthorized Access to Kubernetes API Server Detected Kubernetes

Kubernetes Pods Stuck in CrashloopBackOff State Kubernetes

Node Unavailability in Kubernetes Cluster Kubernetes

Kubernetes Deployment With Multiple Restarts Kubernetes

High Pod Count per Node in Kubernetes Cluster Kubernetes

Kubernetes Node Performance Amid Memory Pressure Kubernetes

Kubernetes Cluster Helath Status Kubernetes

Uptime for Network devices Network

Inode Exhaustion Incident. Linux OS

Input/Output Errors and Data Corruption on Linux Systems Linux OS

Identifying Critical Kernel Issues. Linux OS

SSH Connection Disruptions Linux OS

Host Connection Tracking Limit Incident Linux OS

DDoS Assault Targeting Apache HTTP Server Apache

Unauthorized Directory Access Detected in Apache HTTPD Server Apache

Apache HTTP Server Restart Apache

Increased Apache Workers Load Apache

Apache Cross-Origin Resource Sharing (CORS) Issues Apache

Heightened Memory Consumption by Apache Apache

Apache File Upload Size Limit Exceeded Apache

Apache Child Process Overflow Apache

Apache Server Outage Apache

Failures of Apache Server's Mod_JK Workers Apache

Elevated CPU Usage and Persistent Connections in Apache Server Apache

Detection of URL Redirection Loops in Apache Apache

Insufficient SSL/TLS Configuration for Apache HTTP Server Apache

Nginx File Upload Size Limit Breached Nginx

ETag Header Server Information Leak Protection Nginx

SSL Certificate Expiry nginx incedent Nginx

Elevated 5xx Errors Detected on Nginx Server Nginx

Defending Against Clickjacking Threats Nginx

High 4xx Error Rates On Nginx Nginx

Cross-Origin Resource Sharing (CORS) Issues in Nginx Nginx

Content-Type Discrepancy in Nginx Server Responses Nginx

Nginx Upstream Peer Failure Detected Nginx

Countering DDoS Threats with NGINX Nginx

NGINX Latency Surge Detected Nginx

Tomcat SSL Handshake Failure Tomcat

Tomcat Frequent Full Garbage Collection Tomcat

Tomcat Server Failure Tomcat

Tomcat HTTP Request Headers or Payload Size Exceeds Configured Limits Tomcat

Tomcat High JDBC Connection Pool Utilization Incident Tomcat

Tomcat Thread Pool Depletion Tomcat

Tomcat High Volume of Active Sessions. Tomcat

Tomcat High Swap Utilization Tomcat

Tomcat Elevated Memory Usage Tomcat