Incidents Insights Management Guide on the Elven Platform

This guide aims to provide a clear and detailed view on how to handle incidents, using the metrics and features available in the Elven Platform. Efficiently managing and responding to incidents requires a combination of well-structured data, continuous analysis, and adjustments to operational processes. By using the metrics presented in an integrated way, the Elven Platform offers the necessary tools to improve incident response and increase system stability in an intelligent and efficient manner.

Accessing the Insights Center in the Incidents section

  • Navigate to the main menu and click on Insights.

  • In the submenu, select the item Incidents.

Understanding the Metrics

In the Incidents tab of the Elven Platform, we work with crucial metrics to optimize incident management and improve team response. The information is presented in an intuitive and accessible way, providing a clear and complete view of the incidents. The Total Incidents is highlighted, allowing for a quick visualization of the magnitude of incidents that impacted the systems.

For detailed response tracking, the MTTA (Mean Time to Acknowledgment) and MTTR (Mean Time to Resolution) metrics are clearly displayed, enabling the team to monitor the time required to acknowledge and resolve incidents. Alongside these metrics, the Average/Total Response Effort indicates the time needed to resolve incidents, offering important insights into team efficiency.

Additionally, a Combo Events/MTTs/Average Response Effort provides a consolidated view, making it easier to analyze the total impact of incidents in terms of time and response effort.

The platform also provides information on Incidents Volume per Day, allowing teams to identify occurrence patterns and prioritize actions. The Acknowledgment Rate and Postmortem Rate are key metrics to ensure quick response and continuous learning from past incidents. The distribution of incidents throughout the day is detailed through Incidents Time Cluster, highlighting critical impact moments such as peak or off-hours.

The analysis of Time Cluster Distribution per Month offers a clear view of how incidents behave over time, enabling strategic adjustments in resources and monitoring. The Incidents Day of Week metrics help identify days with higher incident rates, supporting resource planning and mitigation strategies. The platform also allows analysis of Incident Origin, with Incidents per Origin highlighting specific areas that require attention, such as systems or external integrations.

Finally, the visualization of Incidents Hour Interval provides a precise breakdown of incident behavior throughout the day, helping identify peaks and optimize effort allocation. With this information organized in a clear and interactive way, the platform enables agile and efficient incident management, ensuring a smarter and more responsive workflow.

Now let’s dive deeper and share some tips on each of these features that optimize incident management on the Elven Platform.

Total Incidents

To understand the real impact of incidents on the system, it is important to monitor the total number of recorded incidents. This data helps reveal behavior patterns and identify activity spikes that may signal something unusual. Performing periodic analyses is a good practice, as it allows the detection of unexpected increases in incident volume, which may indicate anything from hidden issues to system overload in specific areas.

MTTA/MTTR

Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR) are two essential metrics for measuring the efficiency of the team in responding to incidents. If MTTA or MTTR is high, consider investing in training or making adjustments to workflow, such as increasing automation in initial responses.

Average/Total Response Effort

Monitoring the Average and Total Response Effort is essential to understand how much time, on average, the team takes to resolve incidents. This metric reflects not only the team's agility in handling incidents but also highlights potential bottlenecks or steps that are consuming more time than ideal.

By analyzing the time required for resolution, it’s possible to gain valuable insights into operational efficiency. With this data in hand, it becomes easier to make decisions that optimize workflows, improve resource allocation, and most importantly, ensure that incidents are addressed with the speed the business demands.

Combo Events/MTTs/Average Response Effort

To get a complete view of how incidents are being handled, it’s essential to combine different metrics, such as Total Incidents, MTTs, and Average Response Effort. This combination of data provides a richer overview of response efficiency and helps understand the real impact of each incident on daily operations.

By cross-referencing this information, it’s possible to identify patterns in incidents that require more time or effort to resolve. This makes it easier to detect bottlenecks, whether in processes, teams, or specific technologies. With these insights, you can take more targeted action to optimize workflows, improve tools, and reduce response time, ensuring a more stable environment and a more productive team.

Acknowledgment Rate e Postmortem Rate

The Acknowledgment Rate and Postmortem Rate metrics are powerful allies for evaluating how the team is handling incidents on a daily basis. The Acknowledgment Rate measures how quickly incidents are acknowledged, while the Postmortem Rate shows how many of those incidents resulted in documented learnings. Monitoring these indicators helps ensure not only a fast response, but also the team’s continuous growth.

Setting clear goals to improve these rates is essential. A low acknowledgment rate may signal overload, lack of prioritization, or even internal communication issues. On the other hand, a low postmortem rate may indicate that incidents are being resolved but without generating learning, which prevents real long-term improvements. The idea is to turn every incident into an opportunity for evolution.

Incidents Hour Interval or Time Cluster Distribution per Month

Incident Hour Interval and Time Cluster Distribution per Month metrics are valuable allies when it comes to understanding when incidents occur most frequently—whether during specific hours of the day or particular periods of the month. Having this clear temporal view is essential for anticipating risks, better planning team operations, and ensuring a faster and more strategic response during high-impact moments.

To support this analysis, time has been segmented into three well-defined ranges:

  • Sleep Hour (nighttime): Every day of the week, including weekends and holidays, from 10:00 p.m. to 8:00 a.m.

  • Business Hour (working hours): Monday to Friday, from 8:00 a.m. to 6:00 p.m.

  • Off Hour (outside working hours):

    • Monday to Friday, from 6:00 p.m. to 10:00 p.m.

    • Weekends, from 8:00 a.m. to 10:00 p.m.

With this classification, it's much easier to identify patterns and act proactively, ensuring that the right resources are available at the right time.

Thus, if there are incident peaks at recurring hours, it is worth investigating whether these windows coincide with higher operation volumes, frequent deployments, or even elevated system loads. With this data in hand, it's possible to assess whether the infrastructure needs reinforcement, if processes can be optimized, or if it's necessary to reschedule maintenance or support hours. Small adjustments in this direction can make a significant difference in the stability of the environment.

Incidents Time Cluster e Incidents Day of Week

The analysis of incidents by Time of Day and Day of the Week provides a strategic view of how these events are distributed over time. This perspective allows for clearer identification of recurring patterns, such as more critical hours or days, helping the team anticipate potential risks. As a result, it becomes possible to reinforce operations at the right moments, improve response capacity, and apply more effective preventive actions.

To make this reading even more intuitive, we use the same time segmentation mentioned in the previous item, ensuring consistency in the analysis and facilitating data-driven decision-making.

For example, if there is an increase in incidents on Monday, this may indicate a natural overload after the weekend, whether due to task accumulation, service restarts, or increased system usage. In this case, it may be worth considering team reinforcement or a process review on that day. The goal is to anticipate problems, ensuring the right resources are available when they are most needed.

Incidents per Origin

Understanding the origin of incidents is essential to identify where problems are truly coming from. This visibility allows teams to map failures in specific systems, such as APIs, integrations with external platforms, or critical parts of the internal infrastructure. By clearly knowing the origin, teams can act with greater precision and agility, focusing on what truly needs to be fixed.

If a specific origin, such as API-Auth or API-Report, is consistently related to incidents, this is a clear sign that the area needs attention. In such cases, it is possible to concentrate efforts on:

  • Improving code quality,

  • Increasing automated testing,

  • Adjusting integration processes,

  • Rethinking the architecture of the solution.

By doing so, not only are failures reduced, but confidence in the systems and in the teams that maintain them is also strengthened.

Glossary of Technical Terms

Incidents: An event that has a real impact, such as a failure or interruption. Continuously tracking these incidents is essential to prevent larger issues and ensure a quick and effective response from the team.

Insights Center: Core module of the Elven Platform that provides in-depth analysis of operational and business data, supporting strategic decision-making and performance improvement.

Total Incidents: Metric that indicates the total number of incidents recorded over a period, providing a view of the magnitude of impactful events.

Incidents Day of Week: Metric that shows the daily distribution of incidents, allowing identification of peaks and patterns over time.

Average/Total Response Effort: Indicator that measures the time spent resolving incidents, helping assess team efficiency.

MTTA: Average time the team takes to acknowledge an incident after it is recorded.

MTTR: Average time required to resolve an incident after it has been acknowledged.

Acknowledgment Rate: Percentage of incidents that were quickly acknowledged, indicating the team's effectiveness in initial response.

Postmortem Rate: Percentage of incidents that underwent an analysis, aimed at learning and preventing recurrence.

Incidents Time Cluster: Grouping of incidents based on the time they occur, allowing identification of critical impact periods.

Time Cluster Distribution per Month: Metric that organizes incidents across months, helping identify seasonal trends.

Incidents Day of Week: Metric that distributes incidents by day of the week, enabling resource planning and strategic adjustments.

Incidents per Origin: Classification of incidents based on their source, such as internal systems, APIs, or external integrations.

Incidents Hour Interval: Distribution of incidents across time intervals throughout the day, allowing identification of activity peaks.

Combo Events/MTTs/Average Response Effort: Consolidated view that combines response time, resolution time, and effort metrics, providing a unified analysis of incident impact.

Last updated

Was this helpful?