A comprehensive library of runbooks to streamline your troubleshooting processes, helping you resolve issues faster and more effectively.
Whether you're a seasoned IT professional or just starting out, our runbooks are designed to provide step-by-step guidance for common issues across various systems and technologies.
High CPU utilization can result in performance issues and resource constraints on the server. The playbook helps in identifying the resource-intensive processes and help reduce their impact on the system.
High memory utilization over a period of time can cause problems. The playbook helps in determine the reason for this increase including memory consumption by processes, incorrect system or underlying host configuration, etc.
When hosts run excessive processes, performance and resources may degrade. This playbook aids in identifying processes, analyzing their resource consumption, and detecting any misbehaving processes, such as forkbombs, which could lead to system limits being reached.
A Linux swap space issue occurs when swap usage becomes inefficient, indicating performance concerns. While swap handles memory demands, excessive use suggests underlying issues. The playbook aids in analyzing swap usage and fine-tuning system settings for improved performance.
Inaccurate system clock timing can lead to synchronization issues and timestamp discrepancies across systems, causing inaccurate logging, authentication failures, and troubleshooting complexities. The playbook assists in collecting diagnostic data to identify and resolve the underlying causes, ensuring system reliability.
An unexpected server reboot can disrupt services and risk data loss or corruption. This playbook aids in pinpointing the cause of the reboot, enabling fixes to address underlying issues and prevent future disruption.
An increase in replication lag between master and slave databases can result in data inconsistency and synchronization delays, impacting application functionality, user experience and recovery objectives. The playbook assists in identifying the cause of this increase to mitigate its impact.
High space utilization on a server can cause performance degradation, system instability, and service disruptions. This playbook assists in identifying the sources of disk space consumption and provides strategies for swift recovery.
Slow queries can result in degraded performance and responsiveness of the system. The playbook will help identify the slow queries which are currently running and it's impact on the system performance.
Excessive database connections can lead to performance degradation, resource exhaustion, and service disruptions. This playbook assists in identifying the root causes, such as slow queries or high traffic from application or batch servers, enabling prompt resolution
When the volume of transactions significantly decreases, it may indicate underlying problems such as application failures, database misconfigurations, or network issues. The playbook helps in identify the reason like long running querues or locaked tables.
When the number of threads exceeds the capacity of the MySQL server to handle them efficiently, it can lead to slow query processing, timeouts, and even crashes. The plyabook helps indentify the reason for this increase including long running queries or increased traffic.
CDN performance degradation can cause low website loading times, decreased content delivery speeds, and poor user experience. The playbook helps in identifying bottlenecks, latency issues, and connectivity problems along the delivery path.
When a BGP peer connection is lost, it can lead to traffic redirection issues, route flapping, and difficulty in reaching certain destinations on the internet. The playbook will help identiify the reason for the loss of BGP peer session including problem with last mile.
Multiple factors can cuase an application performacne to degrade this playbook checks APM agent summary, errors, database transactions etc.
Increase in server high I/O wait times can lead to delays in executing tasks, sluggish application performance, and increased response times for user requests. The playbook helps identify factors including disk bottlenecks, inefficient storage configurations, or heavy disk activity.
Increase in HTTP error rate (5xx requests) can cause user dis-satisfaction on site. This playbook gives us insight about any change in baseline of HTTP error rate in last 30 minutes
Applications typically involve multiple servers, including load balancer members, connected databases, utilized cache servers, and other interacting service points. This playbook aids in validating the presence of active alerts on any dependent servers
A decline in the efficiency and speed of network operations managed by a server. This can manifest as slower data transmission, increased latency, reduced bandwidth, or frequent connectivity issues. The playbook checks server to public ulr connectivity diagnosis, this includes MTR, DIG or cURL
Inability to determine the user assigned to an IP address flagged in a malware alert on the firewall can hindered incident response. The playbook will help in identfying the user if it's mapped to an WIFI IP address block of your office or branch Location.
Timely and effective response to potential malware threats is crucial for minimizing the impact of security incidents on the organization's network and systems. This playbook streamlines threat remediation by verifying unknown malware through VirusTotal, and leveraging Cortex XDR APIs to block malicious files and schedule system scans for users.
Disk pressure within Kubernetes nodes can quickly escalate into critical incidents, posing a threat to cluster stability. This playbook help us to mitigate risks. This involves pinpointing the root cause, alleviating disk space constraints, and establishing proactive monitoring measures to forestall future occurrences, ensuring uninterrupted cluster operations."
This will help in identifying the reason for high cpu and will help in troubleshooting the issue better
This incident type in Kubernetes signals disparities between the intended and actual numbers of replica pods, This playbook help in finding a potential misalignment with deployment specifications. It prompts a thorough investigation and corrective measures to realign deployments with defined specifications, ensuring consistency and reliability in cluster operations.
The 'ImagePullBackOff' error in Kubernetes arises when a pod encounters difficulties fetching its container image from the designated repository. Common causes include authentication issues, incorrect image names, or network disruptions. This playbook aids in troubleshooting to rectify the image retrieval failure and guarantee smooth pod deployment within the cluster.
This incident type occurs when a Kubernetes pod is not ready to accept traffic due to a configuration problem or a change in the pod specification. It can cause spikes in the number of unavailable pods and impact the performance of the system. The playbook will help in investigation and provide resolution to restore system stability and ensure uninterrupted service delivery.
A Kubernetes Cronjob Failure incident occurs when a scheduled task, or cronjob, in a Kubernetes cluster fails to execute as expected. The playbook task starts by fetching logs and status information related to the failed Cronjob, helping identify the root cause of the failure.
The incident may be caused by various factors such as resource constraints on the node, affinity rules, image pull issues, misconfigured RBAC, missing default server TLS Secret, missing or invalid annotations, and invalid values of ConfigMap keys. This playbook helps troubleshoot the failure to start the Nginx Ingress Controller on Kubernetes by automating various diagnostic steps.
DNS resolution failure is a common incident type that occurs when services are unable to resolve domain names into IP addresses, leading to communication failure. This playbook against your Kubernetes nodes, you can gather relevant information to diagnose and troubleshoot DNS resolution failures in your cluster.
Nodes with PID Pressure in Kubernetes is an incident type that occurs when a Kubernetes cluster node experiences PID pressure, meaning that it may not be able to start more containers. This playbook helps diagnose and address Kubernetes nodes experiencing PID pressure by providing insights into the PID limits and current process counts on the cluster.
Unauthorized Pod Execution is an incident type that occurs when an unauthorized entity attempts to create a pod in a system without proper permissions. This playbook against your Kubernetes master node(s), you can investigate unauthorized pod execution alerts and gather relevant information to understand how the unauthorized pod was deployed and any associated events.
This incident type involves an issue with generating certificates using the cert-manager deployment on Kubernetes. The incident description outlines steps for troubleshooting the issue, including checking the certificate resource, ingress annotations, and issuer state. This playbook against your Kubernetes master node(s), you can gather relevant information to diagnose certificate generation issues with Cert-Manager.
Kubernetes Pods Pending incident indicates that one or more pods in a Kubernetes cluster are not running as expected and are in a pending state. This can happen due to various reasons such as resource constraints, scheduling issues, or network problems. This playbook helps in investigating and resolving the issue of Kubernetes pods stuck in the "Pending" state by performing several diagnostic tasks.
This incident type is characterized by the detection of unauthorized access to the Kubernetes API server. This unauthorized access potentially enables attackers to manipulate cluster resources. This playbook is designed to help investigate and respond to unauthorized access incidents detected on the Kubernetes API server.
CrashloopBackOff on a Kubernetes Pod is an incident that occurs when a container running in a Kubernetes Pod repeatedly crashes immediately after starting up. This playbook helps streamline the management and maintenance of your Kubernetes environment by automating the detection of problematic pods.
Node Not Ready in Kubernetes Cluster is an incident type that occurs when a node in a Kubernetes cluster fails to respond, is unresponsive, or is not ready to take on workloads. This playbook helps in managing node unavailability issues in a Kubernetes cluster in several ways.
The incident type of "Kubernetes deployment with multiple restarts" indicates that a Kubernetes deployment has experienced multiple restarts within a certain timeframe, which is usually indicative of a problem. This playbook provides visibility into Kubernetes deployments that have experienced multiple restarts, helping you identify potential issues or problematic deployments in your cluster.
This incident type is related to high pod count per node in a Kubernetes cluster. This can happen due to various reasons such as misconfigurations, resource constraints, or issues with the application itself. This playbook helps in addressing the issue of high pod count per node in a Kubernetes.
The Kubernetes Nodes with Memory pressure incident type occurs when a Kubernetes cluster node is running low on memory, which can be caused by a memory leak in an application. This playbook provides a basic framework for monitoring memory pressure across all nodes in a Kubernetes cluster.
This incident type is related to Kubernetes cluster components health status and genrate alert if any error event observed. This playbook contributes to the reliability and stability of your Kubernetes cluster by enabling proactive monitoring and automated responses to health issues.
This will help in identifying the reason for high cpu and memory and will help in troubleshooting the issue better
Such occurrences can result in system instability and impair users' access to and utilization of files and data. This playbook is designed to identify the directories or files with the highest inode usage.
Such errors may result in boot difficulties for users, necessitating troubleshooting to pinpoint and unmount the impacted device. This playbook is designed to identify any system errors or problematic system mount points.
This could result in system instability, data loss, and various other issues. This playbook is designed to identify any kernel taints or hardware errors that may be contributing to the problem.
In such occurrences, users encounter difficulty accessing essential remote resources, potentially causing disruptions in business operations and productivity. This playbook aims to discern whether any firewall rules are impeding access.
Exceeding the host's connection tracking limit can lead to network connectivity disruptions and adversely impact host performance. The playbook will help in identfying the current number of established connectiones with a port/IP address."
This situation can render the server inaccessible to users, disrupting regular operations. This playbook is designed to identify any potential flooding of SYN packets on the server.
Such incidents pose a significant risk of exposing or even pilfering sensitive data, jeopardizing the security of both the server and its associated applications. This playbook is intended to identify virtual host configurations, permissions, and any associated errors.
It may result in a transient outage or interruption of a web service. This playbook is designed to pinpoint causes such as software updates, configuration modifications, or hardware malfunctions.
This could lead to sluggish or non-responsive websites. This playbook aims to determine when the number of workers reaches its maximum limit or when a substantial volume of requests is being processed concurrently.
It prohibits web pages from making requests to domains other than the one serving the page. This playbook is aimed at identifying any directives or errors related to Cross-Origin Resource Sharing (CORS).
This can lead to system unresponsiveness or sluggishness, causing performance degradation. This playbook is designed to identify memory usage by Apache and its associated modules.
This occurrence may lead to the interruption of the file upload process, causing delays in system operation. This playbook is designed to identify instances where the file size exceeds the server's configured limit.
When the Apache server hits a critical threshold, it triggers performance issues and potentially leads to server crashes. This playbook aims to pinpoint factors such as high traffic volume or misconfigured server settings.
This could lead to service disruptions and affect the availability of applications and services hosted on the server. This playbook is designed to identify Apache service status, configuration file errors, and firewall rule issues.
Failure of this module can result in difficulties serving web pages and accessing web applications. This playbook is aimed at identifying what is obstructing traffic to or from Apache or Tomcat.
This type of incident can affect the performance of the Apache server. This playbook is designed to identify the maximum number of simultaneous connections to the server.
This may result in performance degradation and affect the availability of the web server. This playbook aims to detect such loops to prevent issues and ensure the proper functioning of the web server.
If improperly configured, it becomes susceptible to exploitation by hackers for intercepting or altering sensitive data. This playbook is designed to identify SSL certificates, their associated files, versions, and any potential errors.
An incident where Nginx reports a file upload size limit exceeded typically occurs when a user attempts to upload a file to a web application through an Nginx server, but the size of the uploaded file exceeds the configured limit in Nginx. This playbook, you effectively handle the incident by increasing the file upload size limit in Nginx, mitigating the impact on users who may have encountered errors due to the previous limit being exceeded.
This vulnerability could compromise the confidentiality and integrity of the server's data and infrastructure. This plabook helps to disabled the ETag header in Nginx configuration to prevent the disclosure of file metadata.
If the SSL certificate expires and is not promptly renewed, it can lead to a disruption of service for users trying to access the website or application. This playbook streamlines the SSL certificate update process in Nginx servers, making it faster, more reliable, and easier to manage.
The issue pertains to an abnormal increase in 5xx errors on an nginx web server, indicating server-side problems impacting user experience. These errors commonly result from issues such as resource limitations, configuration errors, upstream server problems, or network connectivity issues. This playbook streamlines the troubleshooting and resolution process, reducing manual effort and ensuring a more consistent and efficient response to the issue of higher 5xx errors on the nginx server.
Clickjacking is a deceptive tactic used by attackers to trick users into clicking on malicious elements disguised as legitimate ones on a website. This can lead to unauthorized actions, such as transferring funds, changing account settings, or revealing sensitive information, posing significant security risks to both users and the affected website. This playbook will help protect your Nginx servers from clickjacking attacks by configuring security headers to prevent unauthorized framing and ensure a more secure browsing experience for users.
This incident involves a significant increase in 4xx errors on NGINX upstreams, warranting investigation to determine the underlying cause. This playbook assists in quickly identifying and mitigating the issue of high 4xx errors on NGINX upstreams, allowing for a more efficient incident response process and potentially reducing service disruption for users.
The incident type pertains to the challenges arising from making cross-origin resource sharing (CORS) requests to an Nginx web server. CORS serves as a security measure in web browsers, aiming to prevent unauthorized access to resources across domains. However, when Nginx is not properly configured to allow CORS requests, browsers will block these requests, resulting in errors. This playbook helps resolve CORS errors in Nginx by configuring the server to allow cross-origin requests.
An incident involving a mismatch of Content-Type in Nginx typically occurs when the server sends a response with a Content-Type header that does not match the actual content being delivered. This can lead to unexpected behavior in web applications or browsers. This playbook ensures that Nginx is configured with correct MIME types, potentially resolving Content-Type mismatch issues.
The incident involving NGINX upstream peers failing suggests a disruption in the proper functioning of the NGINX server's upstream connections. These failures are significant enough to trigger alerts, which provide detailed metrics on the percentage of failures and how they deviate from anticipated values over a specific period, aiding in the assessment of severity. This playbook aims to address issues with NGINX upstream peers failing by performing several actions like restart nginx and print debugging information.
This incident involves the implementation of NGINX, a widely used web server software, to defend against Distributed Denial of Service (DDoS) attacks. DDoS attacks involve overwhelming a server with a flood of traffic from multiple sources, resulting in a denial of service for legitimate users. This playbook installs NGINX, configures it with basic settings for DDoS mitigation (like request rate limiting and connection limiting), and ensures that the NGINX service is enabled and started.
This issue may lead to website downtime, sluggish page loads, and degraded user experience. High Nginx latency typically refers to a situation where the response time of Nginx servers is significantly slower than expected. This playbook is designed to address high Nginx latency by performing several key actions like check network traffic, Implement Caching etc.
This can result in downtime or service disruptions for users attempting to access the affected service. This playbook will help identify issues such as incorrect SSL certificate configurations, cipher suite mismatches, or network connectivity problems.
This can lead to performance issues and application downtime. This playbook is designed to diagnose Java heap usage and analyze GC statistics for the Tomcat process.
This can cause a complete or partial system outage, disrupting the delivery of web applications and services. This playbook is designed to identify factors such as misconfigurations, network or hardware issues, software bugs, or excessive server load.
In such instances, the server might decline the request or issue an error message, disrupting the regular operation of the web application. This playbook is tailored to pinpoint factors such as large file uploads, excessive utilization of cookies or query parameters, or deliberate attacks sending oversized requests.
This situation may lead to a shortage of available connections for incoming requests, resulting in slow response times or potential server crashes. This playbook is designed to evaluate the utilization of the JDBC connection pool and the current status of database connections in use.
In the event of its occurrence, incoming requests will remain unprocessed, resulting in service degradation or a complete outage. This playbook aims to identify thread usage and the status of the JVM.
This could result in slow response times, unresponsive applications, or even server crashes. This playbook is crafted to identify the current CPU and memory usage of the Tomcat server, along with monitoring current network connections to the Tomcat server.
This could slow down the server or lead to a crash, causing downtime for the hosted website or application. This playbook is designed to pinpoint which Tomcat processes are consuming the most swap space and memory.
This can result in system instability and performance degradation. This playbook aims to identify the root causes such as memory leaks, misconfigurations, or issues within the application code.
This leads to server crashes or unresponsiveness. This playbook is designed to identify causes such as incorrect configuration settings, memory leaks, or inadequate memory allocation.
In such instances, the system may become unresponsive or slow, leading to performance degradation or even a total system failure. This playbook is designed to identify slow database queries, network latency issues, or resource contention.
It can cause delays in server response time and potentially lead to application downtime. This playbook is intended to diagnose issues such as incorrect configuration, excessive server load, or memory leaks in the application code.
Such incidents may pose potential security vulnerabilities, performance concerns, and compatibility issues with other plugins or applications that depend on them. This playbook is crafted to diagnose the versions of installed plugins and identify available updates for them.
A Jenkins master server failure can disrupt the automation pipeline, resulting in delays in software development and deployment. This playbook is designed to identify the status of replicas, network connections, and other relevant factors.
Such occurrences can prolong software development and deployment processes, leading to extended wait times for developers and ultimately delaying time-to-market for software products. This playbook aims to diagnose issues such as slow build or test execution times, network or server issues, or any other performance-related problems.
This may result in delays in job execution and could signify a broader issue within the system. This playbook is designed to diagnose the status of the Jenkins pod, queue status, and executor status.
This could result in substantial downtime and disrupt the software development and delivery process. Utilizing this playbook, we can identify triggering factors such as misconfiguration, code errors, or infrastructure issues.
This incident pertains to concerns regarding the health score of Jenkins builds, which might signal failures or suboptimal performance. This playbook is designed to identify issues related to Jenkins reachability, plugin status, and configurations.
High backend session usage in HAProxy can lead to service disruptions, degraded performance, negative user experiences, increased operational overhead, missed opportunities, and regulatory compliance risks. This playbook helps mitigate the issue of high backend session usage in HAProxy by increasing the maximum connections allowed in the HAProxy configuration.
The incident is triggered when the error rate deviates from the expected value and surpasses a predefined threshold. It can significantly impact the end-users' experience and necessitates prompt attention to mitigate any potential further problems. This playbook helps streamline the troubleshooting process and reduce the time required to identify and resolve issues with HAProxy.
Anomalous frontend 4xx HTTP responses, potentially stemming from HAProxy issues, can disrupt services, diminish user experience, and incur financial losses for businesses. This playbook streamlines the process of resolving HAProxy-related issues that may be contributing to the occurrence of anomalous 4xx HTTP responses. Anomalous frontend 4xx HTTP responses, potentially stemming from HAProxy issues, can disrupt services, diminish user experience, and incur financial losses for businesses. This playbook streamlines the process of resolving HAProxy-related issues that may be contributing to the occurrence of anomalous 4xx HTTP responses.
Such occurrences can lead to various issues including slow performance, inability to index new data, and even system crashes. This playbook is designed to diagnose disk usage, cluster health, and shard status on Elasticsearch instances.
This may lead to slower query response times and diminished system performance. This playbook is tailored to diagnose query load, query performance of Elasticsearch, and assess cluster state.
This could lead to problems with data indexing, querying, and retrieval. This playbook is designed to identify the Elasticsearch version on each node, assess cluster health, and monitor its status.
These issues can significantly impact system performance, resulting in downtime and potential data loss. This playbook is crafted to diagnose cluster health and statistics, index health and statistics, as well as node information and statistics.
This incident can result in system downtime and affect the performance of applications dependent on Elasticsearch. This playbook is designed to obtain information regarding memory usage and allocation to Elasticsearch.
A "container absent" issue occurs when a container that is expected to be running within a containerized environment like Kubernetes or Docker is not present or has unexpectedly stopped. This can lead to various problems such as application unavailability, service interruptions, or unexpected behavior, depending on the criticality of the container and its services. The playbook helps in identifying the root cause, which could be anything from resource constraints to misconfigurations, network issues, or software bugs.
A Docker Image Pull Failure refers to an incident in which the Docker engine is unable to retrieve a particular image from a container registry. This plabook streamline the troubleshooting process, reduce manual effort, and potentially expedite the resolution of Docker Image Pull Failures.
When Docker containers encounter a network routing issue, it signifies a breakdown in communication among containers due to network configuration problems. This could arise from improper routing of network traffic between containers or conflicts involving IP addresses or port numbers. This plabook helps in understanding the current state of the Docker network setup and identifying potential issues affecting routing.
This can happen due to various reasons such as incorrect volume paths, permissions issues, or conflicts with existing volumes. As a result of this incident, Docker containers may fail to access the required data or configurations stored in volumes, leading to application errors or malfunctions. The playbook assists in identifying the cause of this and potentially speeds up the resolution time.
A conflict in container names incident occurs when there are multiple Docker containers running on the same host with identical names. As a result of this incident, Docker may fail to start or manage containers properly, leading to service disruptions or unexpected behavior in containerized applications. This playbook helps to resolve conflicts in container names by taking the few actions like stop and remove conflicting containers, rename or recreate containers.
This could lead to data loss in the event of one or more node failures. This playbook is designed to identify the number of nodes in the cluster, the replication factor, and the replication strategy.
This could result in various issues including data inconsistencies, node failures, and performance degradation. The playbook is intended to identify the Cassandra version on all nodes, assess the status of all nodes in the cluster, determine the schema version of the cluster, and evaluate the status of the node's anti-entropy service.
This could potentially affect the availability and performance of the Cassandra service. This playbook is designed to investigate the status of the Cassandra cluster and address any connection issues that may arise.
Slow queries can degrade performance and impact system efficiency. This playbook is aimed at investigating the root cause of slow queries, optimizing database configurations, and refining query performance.
This delay can render the system unresponsive and degrade performance. The playbook is designed to identify various factors contributing to this, such as increased traffic, inefficient queries, or hardware issues.
When the average queue size is excessively high, it may indicate that the database is struggling to handle incoming requests, potentially leading to slower response times and service disruptions. This playbook is tailored to diagnose queue size, the number of connections, and the number of requests per second to address performance issues.
This may result in performance issues and could potentially lead to data loss or downtime. The playbook is designed to diagnose disk usage, I/O operations, and read/write performance to address any underlying issues.
If queries cannot be processed quickly enough, clients may experience timeouts, preventing them from retrieving the necessary data. This playbook is designed to diagnose multiple factors contributing to this issue, such as high load on the cluster, network issues, or hardware problems.
This will help in identifying the reason for high memory and will help in troubleshooting the issue better.
These steps facilitate quick identification of Nginx failover causes, streamlining troubleshooting for swift issue resolution and enhanced system reliability.