Home Page Guide
Here are the main operational performance indicators of the Elven Platform, clearly presented to provide a comprehensive overview of what is working well and what can be improved. This dashboard was designed to make it easier to understand the impact of each metric and to support faster and more accurate decision-making.
When analyzed together, these metrics offer a clear and complete view of system performance, enabling you to make quick and precise decisions to ensure everything is running as efficiently as possible. Now, let’s explore each indicator with practical examples, explaining how you can apply them in your daily routine in a simple and intuitive way.
It’s worth noting that all these indicators are available in the Insights section, where you’ll also find detailed documentation that dives deeper into each metric, helping you understand their impact on system operations.
Uptime
Uptime shows the percentage of time the system was available and running without interruptions. The closer to 100%, the better the user experience.
Example: Imagine an e-commerce platform during Black Friday. An uptime of 99.9% means the site was functional almost the entire time, ensuring sales and customer satisfaction. However, even 0.1% of downtime can result in thousands of reais in lost revenue.
Downtime
This indicator, Downtime, shows the total time the system was unavailable during a specific period. It helps identify when critical issues occurred.
Example: Suppose a customer support application was down for 1 hour. This could lead to longer queues, customer frustration, and a direct impact on the company’s reputation.
Outages
Here, in Outages, we measure how many times the system experienced interruptions, even if they were brief. Knowing the frequency of failures is essential to understand whether there is a recurring pattern that needs to be addressed.
Example: A streaming service that experiences 10 interruptions in a single day may be worse for the customer than one longer outage.
Latency (Max, Min, and Average per Hour)
These indicators assess the system’s response speed, measuring the maximum, minimum, and average latency in milliseconds. Low latency means everything is running smoothly.
Example: In a delivery app, high latency can lead to delays for the customer when trying to view the menu or complete an order. If the average latency is around 300ms, it might be time to optimize performance.
Total Incidents
This metric shows how many incidents were recorded during the analyzed period. Understanding the volume helps the team prioritize solutions and optimize processes.
Example: If the dashboard shows that there were 20 incidents in the past week, it’s time to identify whether they were caused by the same issue or if it’s necessary to review the infrastructure.
Total Response Effort
This data reflects the total effort made by the team to respond to incidents, although it’s not always a measurable number.
Example: If resolving an incident requires overtime from the team, it may indicate that the available tools are not sufficient, and it might be necessary to invest in automation.
MTTA (Mean Time to Acknowledge)
Here we evaluate the average time the team takes to acknowledge an incident after it has been reported. The lower the MTTA (Mean Time to Acknowledge), the faster the resolution process begins.
Example: Imagine the system detects a server overload issue. If the MTTA is just 2 minutes, the team can act quickly to prevent a major failure.
MTTR (Mean Time to Resolve)
MTTR (Mean Time to Resolve) shows the average time needed to resolve issues. It is an essential metric for evaluating operational efficiency.
Example: If a network issue takes 4 hours to resolve, but a similar incident was fixed in just 1 hour after configuration improvements, this can serve as a basis for future optimizations.
Last Incidents
Here, we present a detailed view of the most recent incidents. This section provides essential information about each event, such as the Incident ID, Status, description, and timestamp, making it easier to track and analyze the actions taken by the team.
Example: Imagine an issue was detected on the production server, causing system slowness. The incident was recorded at 2:30 PM and resolved by 2:45 PM. The description might be something like “Performance issue due to traffic spikes.” The status would indicate it was "Resolved." This kind of tracking allows the team to understand the impact of each incident and how long it took to resolve, helping with future decision-making to prevent similar failures.
Last Alerts
Here, we provide a detailed view of the most recent alerts recorded in the system, offering a clear analysis of conditions that require immediate attention. This allows the team to take corrective actions quickly, before they escalate into bigger problems. In this section, you’ll find essential information about each alert, such as the Alert ID, Status, description, and the time the alert occurred, making it easier to track and analyze the actions taken by the team.
Example: Suppose the system detects an alert for "Disk usage above 90%" on a server. The alert is triggered at 10:30 AM and resolved by 10:40 AM, with the team performing a cleanup of unnecessary files. The status of the alert would be "Resolved." This alert history helps the team identify which areas need more monitoring or improvements, such as optimizing disk management, helping to prevent the alert from recurring.
Responder Incident Volume
Here, in Responder Incident Volume, we have a bar chart that illustrates the number of incidents assigned by Responder type, such as “SRE”, “No Responder”, among others. This chart provides a clear view of how the workload is distributed among team members, helping to identify potential overloads or imbalances in task distribution.
Example: Imagine that, when viewing the chart, you notice that the SRE team is being assigned to 70% of the incidents, while other groups like “No Responder” have a much lower percentage. This could indicate that the SREs are overloaded or that there’s an opportunity to redistribute the workload. By better balancing the load among responders, you can improve the team’s overall efficiency and reduce the pressure on members handling an excessive volume of incidents. reduzir a pressão sobre os membros que estão lidando com um volume excessivo de incidentes.
Highest MTTA by Responder
In Highest MTTA by Responder, you can view which Responders have the highest average time to acknowledge an incident. This data is crucial for identifying potential bottlenecks in the response process and helps highlight areas where the team can improve its agility in handling incidents.
Example: Imagine that, when analyzing the chart, you notice that the SRE team has an average MTTA of 20 minutes, while other teams show significantly lower values, such as 5 minutes. This may indicate that SREs are taking longer to acknowledge and begin resolving issues, possibly due to lack of resources or task prioritization. With this information, it’s possible to investigate the cause of the delay and implement changes to reduce this time, such as improving monitoring processes or providing additional training to the team so they can respond more quickly.
Highest MTTR by Responder
In Highest MTTR by Responder, the goal is to analyze the average time each Responder takes to resolve incidents. Similar to MTTA, but focused on problem resolution, this metric is essential for identifying which team members may be facing challenges in resolving incidents efficiently.
Example: Suppose that when reviewing the chart, you find that the SRE team has an average MTTR of 2 hours, while other responders are resolving incidents in less than 30 minutes. This data may suggest that the SRE team is facing greater difficulties in problem resolution, whether due to task complexity or lack of resources. Analyzing MTTR allows you to take specific actions, such as providing additional training to the team, improving tool support, or redistributing incidents more evenly — all with the goal of optimizing response time and increasing efficiency.
Glossary of Technical Terms
Uptime: Percentage of time the system is available and operational, without interruptions. Indicates system reliability and stability.
Downtime: Period of time when the system is inactive or unavailable, directly affecting the user experience and operations.
Outages: Interruptions or failures in the system, even if brief. Measuring the frequency helps identify patterns and potential causes.
Latency: Response time of a system or application, measured in milliseconds. Indicators include maximum, minimum, and average latency.
Incident: An unexpected event that causes service interruption or degradation. It can vary in severity and operational impact.
Total Response Effort: Total effort spent by the team to handle incidents, including time, resources, and actions taken.
MTTA (Mean Time to Acknowledge): Average time the team takes to acknowledge an incident after being notified. Reflects efficiency in the initial detection of issues.
MTTR (Mean Time to Resolve): Average time required to resolve an incident. Indicates operational efficiency and the ability to respond to problems.
Alerts: Automatic notifications generated by monitoring systems when conditions requiring immediate attention are detected.
Responder: The person or team assigned to handle specific incidents or alerts, such as SRE teams.
SRE (Site Reliability Engineering): A practice that applies software engineering principles to manage systems and improve reliability and performance.
Responder: Pessoa ou equipe designada para lidar com incidentes ou alertas específicos, como equipes de SRE.
SRE (Site Reliability Engineering): Prática que aplica princípios de engenharia de software para gerenciar sistemas e melhorar a confiabilidade e a performance.
Last updated
Was this helpful?