Alerts Insights Management Guide on the Elven Platform
This guide aims to provide a clear and detailed overview of how to handle alerts using the metrics and features available in the Elven Platform. Handling and responding to alerts efficiently requires a combination of well-structured data, continuous analysis, and adjustments to operational processes. By using the metrics that will be presented in an integrated way, the Elven Platform offers the necessary tools to improve alert response and increase system stability in an intelligent and efficient manner.
Accessing the Insights Center in the Alerts Section
Navigate to the main menu and click on Insights.
In the submenu, select the Alerts item.
Understanding the Metrics
In the Alerts tab of the Elven Platform, we will work with crucial metrics to optimize alert management. The information is presented in an intuitive and accessible way, providing a clear and complete view of the alerts. The Total Alerts is highlighted, allowing a quick visualization of the magnitude of alerts related to system operations.
For a detailed monitoring of alerts, the MTTA (Mean Time to Acknowledgment) and MTTR (Mean Time to Resolution) metrics are clearly displayed, enabling the team to track the time required to acknowledge and resolve alerts. Alongside these metrics, the Average/Total Response Effort is presented, indicating the time needed to resolve alerts, providing important insights into team efficiency.
Additionally, a Combo Events/MTTs/Average Response Effort offers a consolidated view, facilitating the analysis of the total impact of alerts in terms of time and effort for alert response.
The platform also provides information on the Alerts Volume per Day, allowing the identification of occurrence patterns and prioritization of actions. The Acknowledgment Rate and Postmortem Rate are key metrics to ensure quick response and continuous learning from past alerts. The distribution of alerts throughout the day is detailed through the Alerts Time Cluster, highlighting critical moments in the system, such as peak periods during or outside business hours.
The analysis of the Time Cluster Distribution per Month provides a clear view of how alerts behave over time, enabling strategic adjustments in resources and monitoring. The Alerts Day of Week metrics help identify days with higher incidence, facilitating resource planning and mitigation strategies. The platform also allows analysis of the Alert Origin, with Alerts per Origin highlighting specific areas that need attention, such as systems or external integrations.
Finally, the visualization of Alerts by Hour Interval with Alerts Hour Interval offers a precise breakdown of alert behavior throughout the day, helping to identify peaks and optimize effort allocation. With this information organized in a clear and interactive way, the platform enables agile and efficient alert management, ensuring a smarter and more responsive workflow.
Total Alerts
To understand the real impact of the alerts on the system, it is important to monitor the total number of alerts recorded. This data helps reveal behavioral patterns and identify activity spikes that may signal something unusual. Performing regular analyses is a good practice, as it allows the detection of unexpected increases in the volume of alerts, which may indicate anything from hidden issues to overload in parts of the system.
MTTA/MTTR
The Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR) are two essential metrics for measuring the efficiency of the team in alert management. If the MTTA or MTTR is high, consider investing in training or making adjustments to workflow, such as increased automation in initial responses.
Average/Total Response Effort
Tracking the Average and Total Response Effort is essential to understand how much time, on average, the team takes to respond to alerts. This metric shows not only the agility of the team in handling alerts, but also points to possible bottlenecks or steps that are taking more time than ideal. By analyzing the time required for resolution, it is possible to gain valuable insights into operational efficiency. With this data in hand, it becomes easier to make decisions that optimize workflows, improve resource allocation, and most importantly, ensure that alerts are handled with the speed the business demands.
Combo Events/MTTs/Average Response Effort
To have a complete view of how alerts are being handled, it is essential to combine different metrics, such as Total Alerts, MTTs, and Average Response Effort. This combination of data provides a richer overview of response efficiency and helps understand the real impact of each alert on daily operations. By cross-referencing this information, it is possible to identify patterns in alerts that require more time or effort to resolve. This makes it easier to detect bottlenecks, whether in processes, teams, or specific technologies. With these insights, you can act more strategically to optimize workflows, improve tools, and reduce response time, ensuring a more stable environment and a more productive team.
Acknowledgment Rate e Postmortem Rate
The Acknowledgment Rate and Postmortem Rate are powerful allies for evaluating how the team is handling alerts on a daily basis. The Acknowledgment Rate measures how quickly alerts are acknowledged, while the Postmortem Rate shows how many of these alerts resulted in documented learnings. Monitoring these indicators helps ensure not only a fast response, but also the continuous growth of the team. Setting clear goals to improve these rates is essential. A low acknowledgment rate may signal overload, lack of prioritization, or even failures in internal communication. On the other hand, a low postmortem rate may indicate that alerts are being resolved, but without generating learning, which prevents real long-term improvements. The idea is to turn each alert into an opportunity for evolution.
Alerts Hour Interval or Time Cluster Distribution per Month
The Alerts Hour Interval and Time Cluster Distribution per Month metrics are valuable allies when it comes to understanding when alerts occur most frequently—whether during specific hours of the day or particular periods of the month. Having this clear temporal view is essential for anticipating risks, better planning the team’s actions, and ensuring a faster and more strategic response during the most critical moments.
To support this analysis, time has been segmented into three well-defined ranges:
Sleep Hour (nighttime): Every day of the week, including weekends and holidays, from 10:00 p.m. to 8:00 a.m.
Business Hour (working hours): Monday to Friday, from 8:00 a.m. to 6:00 p.m.
Off Hour (outside working hours):
Monday to Friday, from 6:00 p.m. to 10:00 p.m.
Weekends, from 8:00 a.m. to 10:00 p.m.
With this classification, it becomes much easier to identify patterns and act proactively, ensuring that the right resources are available at the right times.
Thus, if there are alert spikes at recurring times, it is worth investigating whether these windows coincide with higher operation volumes, frequent deploys, or even elevated system loads. With this data in hand, it is possible to assess whether the infrastructure needs reinforcement, whether processes can be optimized, or whether it is necessary to reschedule maintenance or support hours. Small adjustments in this direction can make a big difference in the stability of the environment.
Alerts Time Cluster e Alerts Day of Week
The analysis of alerts by Hour of Day and Day of Week provides a strategic view of how these events are distributed over time. This perspective allows for clearer identification of recurring patterns, such as more critical hours or days, helping the team anticipate potential risks. As a result, it becomes possible to reinforce operations at the right moments, improve response capacity, and apply more effective preventive actions.
To make this reading even more intuitive, we use the same time division mentioned in the previous item, which ensures consistency in the analysis and facilitates data-driven decision-making.
For example, if there is an increase in alerts on Monday, this may indicate a natural overload after the weekend, whether due to task accumulation, service restarts, or increased system usage. In this case, it may be worth considering team reinforcement or a process review on that day. The goal is to anticipate problems, ensuring that the right resources are available when they are most needed.
Incidents per Origin
Understanding the Origin of alerts is essential to identify where the problems are truly coming from. This visibility allows mapping failures in specific systems, such as APIs, integrations with external platforms, or critical parts of the internal infrastructure. By clearly knowing the origin, teams can act with greater precision and agility, focusing on what really needs to be fixed.
If a specific origin, such as API-Auth or API-Report, is consistently associated with the alert, this is a clear sign that the area needs attention. In such cases, it is possible to focus efforts on improving code quality, performing more automated testing, adjusting integration processes, or even rethinking the solution architecture. As a result, in addition to reducing failures, it also increases confidence in the systems and the teams that maintain them.
Glossary of Technical Terms
Alerts: Initial signals issued by the system to indicate a possible abnormal behavior or anomaly. Although not every alert represents a real problem, it serves as an early warning that requires attention and analysis. When confirmed that the alert is related to an impact on system operation, it may escalate into an incident. Effective monitoring and triage of alerts are essential to ensure a proactive team response.
Insights Center: Central module of the Elven Platform that provides in-depth analysis of operational and business data, supporting strategic decision-making and performance improvement.
Total Alerts: Metric that indicates the total number of alerts recorded over a period, providing an overview of their influence on the systems.
Alerts Day of Week: Metric that shows the daily distribution of alerts, allowing the identification of peaks and patterns over time.
Average/Total Response Effort: Indicator that measures the time spent resolving alerts, helping assess team efficiency.
MTTA: Average time the team takes to acknowledge an alert after it is recorded.
MTTR: Average time required to resolve an alert after it has been acknowledged.
Acknowledgment Rate: Percentage of incidents that were quickly acknowledged, indicating the team's effectiveness in the first response.
Postmortem Rate: Percentage of alerts that underwent analysis, aiming at learning and preventing recurrence.
Alerts Time Cluster: Grouping of alerts based on the time they occur, allowing identification of critical impact periods.
Time Cluster Distribution per Month: Metric that organizes alerts over the months, making it easier to identify seasonal trends.
Alerts Day of Week: Metric that distributes alerts by days of the week, enabling resource planning and strategic adjustments.
Alerts per Origin: Classification of alerts based on their sources, such as internal systems, APIs, or external integrations.
Alerts Hour Interval: Distribution of alerts in time intervals throughout the day, allowing identification of activity peaks.
Combo Events/MTTs/Average Response Effort: Consolidated view that combines response time, resolution, and effort metrics, providing a unified analysis of the influence of alerts on systems.
Last updated
Was this helpful?