Guia de como Criar de um Postmortem do Incidente

In the Elven Platform, the Postmortem process is an essential practice for promoting continuous improvement and a learning culture. After incidents occur, a detailed analysis is conducted to examine the root causes, impacts, and the responses taken, documenting the key points.

The goal is to identify opportunities to prevent recurrence and strengthen the resilience of the system. The collaborative and data-driven approach ensures that the lessons learned are shared, contributing to the evolution of the platform and enhancing DevOps/SRE processes.

Accessing the Postmortem Center

  • Access the Incident Management tab in the top menu.

  • Then, click on Postmortems.

Efficient, Guided, and Collaborative Documentation

The Elven Platform offers a dedicated editor for creating Postmortems, designed to make the incident documentation process more simple, clear, and collaborative. With Markdown support, the editor allows teams to report events in an organized and readable way, maintaining information consistency and making later analysis easier. All of this contributes to a more complete view of the events—from causes to impacts, as well as the actions taken and lessons learned.

In addition, the editor includes interactive guides, such as the Summary field, which provides step-by-step guidance for filling in the most important information: what happened, when, what the impact was, and the status after resolution. This helps save time and ensures that no relevant data is missed.

By bringing everything together in one place, the platform encourages a culture of continuous learning, reinforces best practices, and improves team communication, promoting more effective actions in the future.

Creating a Postmortem

When starting the creation of a Postmortem in the Elven Platform, the first step is to choose a clear and descriptive title that directly communicates what happened. Something like "High CPU Load in Production Environment" already provides good context from the start. This attention to detail makes it easier to read and understand, both for those involved in the incident and for others who may consult the document in the future.

After the title, use the platform’s text editor to fill in the main Postmortem sections. Each one is designed to make the process more fluid and intuitive:

Summary

Briefly explain what happened, including the date and time of the incident, the overall impact, and the final status after resolution. Example: On November 13, 2024, monitoring detected a CPU load increase above 96% in the test environment. This caused an overload, potentially affecting service performance. The team intervened and normalized CPU usage within 30 minutes.

Root Cause

Provide a detailed explanation of the origin of the problem. Include what caused the incident, such as a misconfiguration or poorly scheduled task. Example: The overload was caused by a test job that was mistakenly scheduled to run in parallel with other CPU-intensive tasks, resulting in a temporary saturation of available resources.

Recovery

Describe the exact steps taken to resolve the issue, including any adjustments made and the total recovery time. Example:

  • The team stopped the automatic job that caused the overload.

  • Rescheduled the job timeline to avoid overlap.

  • Monitored the system for 30 minutes after the changes to confirm stability.

Corrective Actions

List the improvements implemented to prevent the issue from recurring. This section is essential to demonstrate learning and a commitment to continuous improvement. Example:

  • Review job scheduling to minimize simultaneous execution of heavy tasks.

  • Adjust CPU alert thresholds to notify at intermediate loads (e.g., 75% and 85%).

  • Document the incident and conduct team training on best practices.

Associate the Postmortem with a Specific Incident

In the Elven Platform, creating a Postmortem is not just a documentation process, but a valuable opportunity for continuous learning and strategy improvement. By linking the Postmortem to a specific incident, we ensure that every detail of the event is connected to a clear historical record, providing insights that go beyond immediate resolution.

Save the Postmortem Form

After documenting each step of the incident and defining preventive strategies, it's time to consolidate all the learning. In the Elven Platform, saving the Postmortem is more than just archiving a report—it's about ensuring that all the knowledge generated is accessible and actionable for the team.

To do this, simply click the SAVE POSTMORTEM button to officially record your analyses, insights, and improvement plans.

Edit, Delete, and Export Your Postmortem

In the Postmortem Center of the Elven Platform, you have full control over the reports created to document and learn from incidents. From the postmortem list, you can perform actions such as edit, delete, or even export to PDF in a practical and intuitive way. With just a few clicks on the actions menu (the three dots in the “Actions” column), you can update important information, remove your postmortem, or export it to PDF.

Glossary of Technical Terms

Postmortem: A detailed analysis process conducted after the resolution of an incident, aimed at identifying root causes, impacts, responses taken, and corrective actions. The postmortem seeks to improve processes and prevent the recurrence of issues.

Postmortem Center: A dedicated area within the Elven Platform for creating and viewing postmortems. It allows for documenting and sharing the lessons learned after an incident.

Postmortem Title: A field where the user enters a clear and descriptive title for the postmortem, making it easier to identify the analyzed incident (e.g., “High CPU Load in Production Environment”).

Summary: A section providing a concise summary of the incident, including the date, time, overall impact, and the final status after resolution. It helps contextualize what occurred during the incident.

Root Cause: A detailed explanation of the root cause of the incident, identifying the factors that led to the issue. This may include configuration errors, task scheduling failures, or other contributing elements.

Recovery: A description of the exact steps taken to resolve the incident, including system adjustments and the total recovery time. This section outlines the immediate response to the problem.

Corrective Actions: Corrective measures implemented after the incident resolution to prevent recurrence. These may involve configuration changes, process improvements, or team training.

Save Postmortem: A function that allows storing the postmortem information after completion, ensuring the report is saved and available for future reference.

Edit Postmortem: A function that enables modifying an already saved postmortem, in case updates or additional details are needed after the initial creation.

Delete Postmortem: A function that allows removing a postmortem from the system if the report is no longer needed or requires substantial correction.

Export Postmortem: A function that allows generating a PDF file of the postmortem for distribution or external storage. Exporting facilitates sharing information with teams or stakeholders.

Last updated

Was this helpful?