# Home Page Guide

Here are the main **operational performance indicators** of the **Elven Platform**, clearly presented to provide a comprehensive overview of what is working well and what can be improved. This **dashboard** was designed to make it easier to understand the **impact** of each **metric** and to support faster and more accurate **decision-making**.

When analyzed together, these **metrics** offer a clear and complete view of **system performance**, enabling you to make quick and precise decisions to ensure everything is running as efficiently as possible. Now, let’s explore each **indicator** with practical examples, explaining how you can apply them in your daily routine in a simple and intuitive way.

It’s worth noting that all these **indicators** are available in the **Insights section**, where you’ll also find detailed **documentation** that dives deeper into each **metric**, helping you understand their **impact** on **system operations**.

{% embed url="<https://demo.elven.works/demo/cmd352viq07jpvm0i126kus60>" %}

## **Uptime**

**Uptime** shows the **percentage of time** the system was **available** and running **without interruptions**. The closer to **100%**, the better the **user experience**.

**Example**:\
Imagine an **e-commerce** platform during **Black Friday**. An **uptime** of **99.9%** means the site was functional almost the entire time, ensuring **sales** and **customer satisfaction**. However, even **0.1% of downtime** can result in **thousands of reais** in lost revenue.

## **Downtime**&#x20;

This indicator, **Downtime**, shows the **total time** the system was **unavailable** during a specific period. It helps identify when **critical issues** occurred.

**Example**:\
Suppose a **customer support application** was down for **1 hour**. This could lead to **longer queues**, **customer frustration**, and a direct impact on the company’s **reputation**.

## **Outages**

Here, in **Outages**, we measure how many times the system experienced **interruptions**, even if they were brief. Knowing the **frequency of failures** is essential to understand whether there is a **recurring pattern** that needs to be addressed.

**Example**:\
A **streaming service** that experiences **10 interruptions** in a single day may be worse for the customer than one **longer outage**.

## **Latency (Max, Min, and Average per Hour)**

These indicators assess the **system’s response speed**, measuring the **maximum**, **minimum**, and **average latency** in **milliseconds**. **Low latency** means everything is running smoothly.

**Example**:\
In a **delivery app**, **high latency** can lead to delays for the customer when trying to **view the menu** or **complete an order**. If the **average latency** is around **300ms**, it might be time to **optimize performance**.

## **Total Incidents**

This **metric** shows how many **incidents** were recorded during the analyzed period. Understanding the **volume** helps the **team** prioritize **solutions** and optimize **processes**.

**Example**:\
If the **dashboard** shows that there were **20 incidents** in the past week, it’s time to identify whether they were caused by the **same issue** or if it’s necessary to review the **infrastructure**.

## **Total Response Effort**

This **data** reflects the total **effort** made by the **team** to respond to **incidents**, although it’s not always a **measurable number**.

**Example**:\
If resolving an **incident** requires **overtime** from the team, it may indicate that the available **tools** are not sufficient, and it might be necessary to invest in **automation**.

## **MTTA (Mean Time to Acknowledge)**

Here we evaluate the **average time** the **team** takes to **acknowledge** an **incident** after it has been reported. The lower the **MTTA (Mean Time to Acknowledge)**, the faster the **resolution process** begins.

**Example**:\
Imagine the system detects a **server overload issue**. If the **MTTA** is just **2 minutes**, the team can act quickly to prevent a **major failure**.

## **MTTR (Mean Time to Resolve)**

**MTTR (Mean Time to Resolve)** shows the **average time** needed to **resolve issues**. It is an essential **metric** for evaluating **operational efficiency**.

**Example**:\
If a **network issue** takes **4 hours** to resolve, but a similar **incident** was fixed in just **1 hour** after **configuration improvements**, this can serve as a basis for future **optimizations**.

## **Last Incidents**&#x20;

Here, we present a detailed view of the **most recent incidents**. This section provides essential **information** about each event, such as the **Incident ID**, **Status**, **description**, and **timestamp**, making it easier to **track** and **analyze** the actions taken by the **team**.

**Example**:\
Imagine an issue was detected on the **production server**, causing **system slowness**. The **incident** was recorded at **2:30 PM** and resolved by **2:45 PM**. The **description** might be something like “**Performance issue due to traffic spikes**.” The **status** would indicate it was "**Resolved**."\
This kind of **tracking** allows the team to understand the **impact** of each incident and how long it took to **resolve**, helping with future **decision-making** to prevent similar failures.

## **Last Alerts**&#x20;

Here, we provide a detailed view of the **most recent alerts** recorded in the system, offering a clear analysis of conditions that require **immediate attention**. This allows the **team** to take **corrective actions** quickly, before they escalate into bigger problems.\
In this section, you’ll find essential **information** about each **alert**, such as the **Alert ID**, **Status**, **description**, and the **time** the alert occurred, making it easier to **track** and **analyze** the actions taken by the team.

**Example**:\
Suppose the system detects an alert for "**Disk usage above 90%**" on a server. The **alert** is triggered at **10:30 AM** and resolved by **10:40 AM**, with the team performing a **cleanup of unnecessary files**. The **status** of the alert would be "**Resolved**."\
This **alert history** helps the team identify which areas need more **monitoring** or **improvements**, such as optimizing **disk management**, helping to prevent the alert from recurring.

## **Responder Incident Volume**&#x20;

Here, in **Responder Incident Volume**, we have a **bar chart** that illustrates the number of **incidents assigned** by **Responder type**, such as **“SRE”**, **“No Responder”**, among others. This chart provides a clear view of how the **workload** is distributed among **team members**, helping to identify potential **overloads** or **imbalances** in **task distribution**.

**Example**:\
Imagine that, when viewing the **chart**, you notice that the **SRE team** is being assigned to **70% of the incidents**, while other groups like **“No Responder”** have a much lower percentage. This could indicate that the **SREs are overloaded** or that there’s an opportunity to **redistribute the workload**.\
By better **balancing** the load among **responders**, you can improve the team’s overall **efficiency** and reduce the **pressure** on members handling an **excessive volume** of incidents. reduzir a pressão sobre os membros que estão lidando com um volume excessivo de incidentes.&#x20;

## **Highest MTTA by Responder**&#x20;

In **Highest MTTA by Responder**, you can view which **Responders** have the **highest average time** to **acknowledge an incident**. This data is crucial for identifying potential **bottlenecks** in the **response process** and helps highlight areas where the team can improve its **agility** in handling incidents.

**Example**:\
Imagine that, when analyzing the **chart**, you notice that the **SRE team** has an **average MTTA** of **20 minutes**, while other teams show significantly lower values, such as **5 minutes**. This may indicate that **SREs** are taking longer to **acknowledge** and begin resolving issues, possibly due to **lack of resources** or **task prioritization**.\
With this information, it’s possible to investigate the cause of the delay and implement changes to reduce this time, such as improving **monitoring processes** or providing additional **training** to the team so they can **respond more quickly**.

## **Highest MTTR by Responder**&#x20;

In **Highest MTTR by Responder**, the goal is to analyze the **average time** each **Responder** takes to **resolve incidents**. Similar to **MTTA**, but focused on **problem resolution**, this **metric** is essential for identifying which **team members** may be facing challenges in resolving incidents efficiently.

**Example**:\
Suppose that when reviewing the **chart**, you find that the **SRE team** has an **average MTTR** of **2 hours**, while other **responders** are resolving incidents in less than **30 minutes**. This data may suggest that the **SRE team** is facing greater difficulties in **problem resolution**, whether due to **task complexity** or **lack of resources**.\
Analyzing **MTTR** allows you to take specific actions, such as providing additional **training** to the team, improving **tool support**, or **redistributing incidents** more evenly — all with the goal of optimizing **response time** and increasing **efficiency**.

## **Glossary of Technical Terms**

**Uptime: Percentage of time** the system is **available** and **operational**, without interruptions. Indicates **system reliability** and **stability**.

**Downtime: Period of time** when the system is **inactive** or **unavailable**, directly affecting the **user experience** and **operations**.

**Outages: Interruptions** or **failures** in the system, even if brief. Measuring the **frequency** helps identify **patterns** and potential **causes**.

**Latency: Response time** of a system or application, measured in **milliseconds**. Indicators include **maximum**, **minimum**, and **average latency**.

**Incident:** An **unexpected event** that causes **service interruption** or **degradation**. It can vary in **severity** and **operational impact**.

**Total Response Effort:** Total **effort** spent by the **team** to handle **incidents**, including **time**, **resources**, and **actions** taken.

**MTTA (Mean Time to Acknowledge): Average time** the team takes to **acknowledge** an **incident** after being notified. Reflects **efficiency** in the **initial detection** of issues.

**MTTR (Mean Time to Resolve): Average time** required to **resolve** an **incident**. Indicates **operational efficiency** and the ability to **respond to problems**.

**Alerts: Automatic notifications** generated by **monitoring systems** when conditions requiring **immediate attention** are detected.

**Responder:** The **person** or **team** assigned to handle specific **incidents** or **alerts**, such as **SRE teams**.

**SRE (Site Reliability Engineering):** A practice that applies **software engineering principles** to manage systems and improve **reliability** and **performance**.

**Responder:** Pessoa ou equipe designada para lidar com incidentes ou alertas específicos, como equipes de SRE.&#x20;

**SRE (Site Reliability Engineering):** Prática que aplica princípios de engenharia de software para gerenciar sistemas e melhorar a confiabilidade e a performance.
