Responders Insights Management Guide on the Elven Platform
The Insights Responders feature of the Elven Platform provides a complete view of the performance and efficiency of event response teams, helping to understand patterns, response times, and workload. These analyses bring clarity and direct action to optimize operations and ensure your team responds to events with agility and efficiency.
Accessing the Insights Center in the Responders Section
Navigate to the main menu and click on Insights.
In the submenu, select the Responders item.
Understanding the Metrics
The study of metrics related to event management is essential to improve operational efficiency and system resilience in complex environments. Metrics such as Total Events, Responder MTTA, Responder MTTR, and Responder Average/Total Response Effort provide crucial information to evaluate the effectiveness of teams in identifying, responding to, and resolving issues. By analyzing these metrics, it is possible to identify bottlenecks in the event response process, optimize resources, and reduce the impact of failures in the production environment. These and other metrics serve as a foundation for informed decision-making and continuous improvement of operations.
Total Events
Total Events is an essential metric that reflects the overall workload faced by responders. Through this metric, it is possible to identify trends over time, such as increases or decreases in the number of events, and plan preventive actions. Additionally, the total number of events serves as a direct indicator of the team's operational demand.
Example: If your team recorded 72 events in the month, you can compare this number with previous months. In August, for instance, there were only 50 events, which represents a 44% increase. This growth may indicate a systemic issue that deserves attention, such as recurring failures or infrastructure changes.
Responder MTTA
Responder MTTA (Mean Time To Acknowledge) indicates the average time required for responders to become aware of the existence of events. In other words, it represents the period between the event detection and the moment when the responders (usually a person or team responsible for resolution) formally acknowledge that the event has been identified.
Example: The Responder MTTA is showing 1 day, 10 hours, and 2 minutes for a one-week period, which means that, on average, responders take this amount of time to acknowledge an event after it occurs. This high value suggests a considerable delay in the identification and acknowledgment of events, which can negatively affect the team's response capability, especially in critical situations.
Responder MTTR
Responder MTTR (Mean Time To Resolve) indicates the average time required for responders to resolve events, considering the period between the event acknowledgment and its complete resolution. It is an important indicator of the team's efficiency in recovering affected systems or services. A high Responder MTTR may indicate challenges in quick resolution, which can impact service continuity.
Example: The Responder MTTR is showing 7 hours and 10 minutes for a one-week period, which means that, on average, responders take this time to resolve an event after it has been acknowledged. This value suggests a reasonable response capability, but in critical situations, a faster resolution may be necessary to avoid greater impacts.
Responder Average Response Effort
The Responder Average Response Effort represents the average time invested by responders in resolving events and demonstrates the average effort required to keep operations running. This metric helps evaluate the average efficiency of teams and identify opportunities to optimize processes. In summary, it indicates the average time responders dedicated to resolving events.
Example: The Responder Average Response Effort shows that teams invested, on average, 10 days and 2 hours last month to resolve events. When cross-referencing this information with the total number of events, it is observed that the average per event was approximately 3 hours. However, there was a specific case that required 5 days of effort, suggesting the presence of an isolated issue with disproportionate effort, which deserves further analysis.
Responder Total Response Effort
The Responder Total Response Effort represents the total time invested by responders in resolving events within a given period. This metric provides a clear view of the aggregated effort of the teams to keep operations running, being useful to measure operational demand, identify effort spikes, and guide decisions on resource allocation.
Example: The Responder Total Response Effort shows that the teams dedicated a total of 120 hours last month to resolve events. Considering a total of 40 events, this results in an average of 3 hours per event. However, one specific case required 40 hours of continuous effort, indicating an atypical event that consumed a significant amount of time. This deviation reinforces the importance of investigating root causes and seeking ways to mitigate recurrence.
Combo Events/MTTs/Average Response Effort
The Events/MTTs/Average Response Effort combo provides a comprehensive view of the operational performance of the response teams. By cross-referencing the total number of events with the average response times (MTTA and MTTR) and the average effort invested (Average Response Effort), it is possible to identify bottlenecks, atypical events, and trends of increasing or decreasing workload. This combination of metrics helps in prioritizing continuous improvement actions, properly sizing teams, and improving overall operational efficiency.
Example: If the number of events has increased significantly, but the MTTR and Average Response Effort remain stable, this may indicate good team scalability. On the other hand, a high Average Response Effort with few events may point to complex events or inefficient processes that require attention and review.
Responder Event Volume
The Responder Event Volume represents the distribution of event volume by team or individual responder, helping to identify how the workload is distributed. This allows for adjustments in resource allocation or balancing of responsibilities to avoid overload. This metric clearly shows the workload distributed per responder or team.
Example: In a hypothetical situation, the Responder Event Volume chart shows that the SRE team handled 64 incidents, while the Squad Telemetry team dealt with only 8. This imbalance may indicate that escalation processes are not functioning properly and that it may be necessary to review the event distribution strategy among teams.
Highest MTTA by Responder
Highest MTTA by Responder identifies which teams or individual responders have the highest average acknowledgment time (MTTA – Mean Time to Acknowledge). This metric is essential for understanding potential bottlenecks in the event acknowledgment process and optimizing workflows. Highlighting responders with the highest times helps in analysis and implementing improvements in response efficiency.
Example: If the Highest MTTA by Responder analysis reveals that the SRE team had the highest average acknowledgment time, being 1 day, 10 hours, and 2 minutes, this shows a considerable delay between event detection and proper attention from the team. This high value may indicate operational bottlenecks, such as poorly configured alerts, inadequate prioritization, or even low team availability during critical hours. To mitigate this issue, it is essential to review monitoring processes, resource allocation, and notification workflows in order to increase response agility and reduce operational impact.
Highest MTTR by Responder
Highest MTTR by Responder identifies which teams or individual responders have the highest average resolution time (MTTR – Mean Time to Resolve). This metric is fundamental for understanding potential bottlenecks in the event resolution process and improving operational efficiency. Analyzing MTTR helps identify where the team can improve its ability to restore services quickly, minimizing negative impacts.
Example: If the Highest MTTR by Responder analysis shows that the SRE team has the highest average resolution time, with 7 hours and 10 minutes, this indicates that, among all responders, this team takes the longest to restore services after acknowledging an incident. This high value may point to difficulties in the resolution phase, such as manual processes, lack of automation, or specific technical challenges. Although the time is not excessively high, in critical situations, such delay can result in significant business impact, making it necessary to review response and recovery procedures.
Glossary of Technical Terms
Events: These are records of occurrences that may affect system functionality. They act as warning signals, helping to identify abnormal behavior. Initially, an event may appear as an alert, and when it confirms a real impact—such as a failure or interruption—it is treated as an incident. Continuously monitoring these events is essential to prevent larger issues and ensure a fast and effective team response.
Insights Center: Central module of the Elven Platform that provides in-depth analysis of operational and business data, supporting strategic decision-making and performance improvement.
Insights Responders: Elven Platform feature that provides a detailed view of the performance and efficiency of incident response teams, facilitating the analysis of metrics, patterns, and workload volumes.
MTTA (Mean Time To Acknowledge): Average time between the detection of an event and its acknowledgment by a responder. Measures the team’s agility in recognizing that a problem has occurred.
MTTR (Mean Time To Resolve): Average time between the acknowledgment of an event and its full resolution. Reflects the efficiency in recovering affected systems or services.
Responder: Person or team responsible for acknowledging, analyzing, and resolving an event or incident in operational systems.
Total Events: Total number of events (alerts/incidents) recorded within a given period. Indicates the operational demand faced by the teams.
Responder MTTA: Average time responders take to acknowledge an event after its detection. May vary between teams or individuals.
Responder MTTR: Average time responders take to resolve an event after it has been acknowledged. Helps identify slow points in the resolution process.
Responder Average Response Effort: Average time invested by responders to resolve events. Helps identify whether incidents are routine or complex.
Responder Total Response Effort: Total time spent by responders resolving all events within a period. Useful for analyzing operational load and effort peaks.
Combo Events/MTTs/Average Response Effort: Combined analysis of the total number of events, average response times (MTTA and MTTR), and average effort. Enables visualization of trends, bottlenecks, and operational efficiency.
Responder Event Volume: Distribution of event volume by responder or team. Helps assess overload, uneven distribution, or inefficiencies in escalation processes.
Highest MTTA by Responder: Indicates the highest average acknowledgment time among responders. Helps identify slow responses and potential bottlenecks in event acknowledgment.
Highest MTTR by Responder: Indicates the highest average resolution time among responders. Reveals who or which team takes the longest to resolve issues, pointing to a need for optimization.
Escalation: Process of redirecting or distributing events/incidents among different responders or support levels, based on severity or area of expertise.
Acknowledgment: Moment when a responder officially becomes aware of an event and takes responsibility for investigating it.
Resolution: Moment when the incident is considered fully resolved, restoring the functionality of the affected system or service.
Effort: Amount of time and energy dedicated by a responder or team to analyze, mitigate, and resolve an event.
Last updated
Was this helpful?