# Incidents Insights Management Guide on the Elven Platform

This guide aims to provide a clear and detailed view on how to handle **incidents**, using the **metrics** and **features** available in the **Elven Platform**. Efficiently managing and responding to **incidents** requires a combination of well-structured **data**, continuous **analysis**, and adjustments to **operational processes**. By using the **metrics** presented in an integrated way, the **Elven Platform** offers the necessary **tools** to improve **incident response** and increase **system stability** in an intelligent and efficient manner.

{% embed url="<https://demo.elven.works/demo/cmd35yqvo0064ze0irf3qzkl6>" %}

## **Accessing the Insights Center** in the **Incidents** section

* Navigate to the **main menu** and click on **Insights**.
* In the **submenu**, select the item **Incidents**.

## **Understanding** the **Metrics**

In the **Incidents** tab of the **Elven Platform**, we work with crucial **metrics** to optimize **incident management** and improve **team response**. The **information** is presented in an intuitive and accessible way, providing a clear and complete view of the **incidents**. The **Total Incidents** is highlighted, allowing for a quick visualization of the magnitude of incidents that impacted the systems.

For detailed **response tracking**, the **MTTA** (**Mean Time to Acknowledgment**) and **MTTR** (**Mean Time to Resolution**) metrics are clearly displayed, enabling the team to monitor the time required to acknowledge and resolve incidents. Alongside these metrics, the **Average/Total Response Effort** indicates the time needed to resolve incidents, offering important insights into **team efficiency**.

Additionally, a **Combo Events/MTTs/Average Response Effort** provides a consolidated view, making it easier to analyze the total impact of incidents in terms of **time** and **response effort**.

The platform also provides information on **Incidents Volume per Day**, allowing teams to identify **occurrence patterns** and prioritize actions. The **Acknowledgment Rate** and **Postmortem Rate** are key metrics to ensure quick response and continuous learning from past incidents. The distribution of incidents throughout the day is detailed through **Incidents Time Cluster**, highlighting critical impact moments such as peak or off-hours.

The analysis of **Time Cluster Distribution per Month** offers a clear view of how incidents behave over time, enabling strategic adjustments in **resources** and **monitoring**. The **Incidents Day of Week** metrics help identify days with higher incident rates, supporting **resource planning** and **mitigation strategies**. The platform also allows analysis of **Incident Origin**, with **Incidents per Origin** highlighting specific areas that require attention, such as systems or external integrations.

Finally, the visualization of **Incidents Hour Interval** provides a precise breakdown of incident behavior throughout the day, helping identify peaks and optimize **effort allocation**. With this information organized in a clear and interactive way, the platform enables agile and efficient **incident management**, ensuring a smarter and more responsive **workflow**.

Now let’s dive deeper and share some tips on each of these features that optimize **incident management** on the **Elven Platform**.

### **Total Incidents**

To understand the real impact of **incidents** on the system, it is important to monitor the total number of recorded **incidents**. This **data** helps reveal **behavior patterns** and identify **activity spikes** that may signal something unusual. Performing **periodic analyses** is a good practice, as it allows the detection of unexpected increases in **incident volume**, which may indicate anything from hidden issues to **system overload** in specific areas.

### **MTTA/MTTR**

**Mean Time to Acknowledge (MTTA)** and **Mean Time to Resolve (MTTR)** are two essential **metrics** for measuring the **efficiency** of the team in responding to **incidents**. If **MTTA** or **MTTR** is high, consider investing in **training** or making adjustments to **workflow**, such as increasing **automation** in initial responses.

### **Average**/**Total Response Effort**&#x20;

Monitoring the **Average** and **Total Response Effort** is essential to understand how much time, on average, the team takes to resolve **incidents**. This **metric** reflects not only the team's **agility** in handling incidents but also highlights potential **bottlenecks** or steps that are consuming more time than ideal.

By analyzing the time required for **resolution**, it’s possible to gain valuable **insights** into **operational efficiency**. With this **data** in hand, it becomes easier to make decisions that optimize **workflows**, improve **resource allocation**, and most importantly, ensure that **incidents** are addressed with the speed the business demands.

### **Combo Events/MTTs/Average Response Effort**

To get a complete view of how **incidents** are being handled, it’s essential to combine different **metrics**, such as **Total Incidents**, **MTTs**, and **Average Response Effort**. This combination of **data** provides a richer overview of **response efficiency** and helps understand the real impact of each **incident** on daily operations.

By cross-referencing this **information**, it’s possible to identify **patterns** in incidents that require more **time** or **effort** to resolve. This makes it easier to detect **bottlenecks**, whether in **processes**, **teams**, or specific **technologies**. With these **insights**, you can take more targeted action to optimize **workflows**, improve **tools**, and reduce **response time**, ensuring a more stable environment and a more productive team.

### **Acknowledgment Rate e Postmortem Rate**

The **Acknowledgment Rate** and **Postmortem Rate** metrics are powerful allies for evaluating how the team is handling **incidents** on a daily basis. The **Acknowledgment Rate** measures how quickly **incidents** are acknowledged, while the **Postmortem Rate** shows how many of those **incidents** resulted in documented **learnings**. Monitoring these **indicators** helps ensure not only a fast **response**, but also the team’s continuous **growth**.

Setting clear **goals** to improve these **rates** is essential. A low **acknowledgment rate** may signal **overload**, lack of **prioritization**, or even **internal communication** issues. On the other hand, a low **postmortem rate** may indicate that **incidents** are being resolved but without generating **learning**, which prevents real long-term **improvements**. The idea is to turn every **incident** into an opportunity for **evolution**.

### **Incidents Hour Interval or Time Cluster Distribution per Month**

**Incident Hour Interval** and **Time Cluster Distribution per Month** metrics are valuable allies when it comes to understanding when incidents occur most frequently—whether during specific hours of the day or particular periods of the month. Having this clear **temporal view** is essential for anticipating risks, better planning team operations, and ensuring a faster and more strategic response during high-impact moments.

To support this analysis, time has been segmented into three well-defined ranges:

* **Sleep Hour** (nighttime):\
  Every day of the week, including weekends and holidays, from **10:00 p.m. to 8:00 a.m.**
* **Business Hour** (working hours):\
  Monday to Friday, from **8:00 a.m. to 6:00 p.m.**
* **Off Hour** (outside working hours):
  * Monday to Friday, from **6:00 p.m. to 10:00 p.m.**
  * Weekends, from **8:00 a.m. to 10:00 p.m.**

With this classification, it's much easier to identify patterns and act proactively, ensuring that the right **resources** are available at the right time.

Thus, if there are **incident peaks** at recurring hours, it is worth investigating whether these windows coincide with higher **operation volumes**, frequent **deployments**, or even **elevated system loads**. With this data in hand, it's possible to assess whether the **infrastructure** needs reinforcement, if **processes** can be optimized, or if it's necessary to reschedule **maintenance** or **support** hours. Small adjustments in this direction can make a significant difference in the **stability of the environment**.

### **Incidents Time Cluster e Incidents Day of Week**

The analysis of **incidents** by **Time of Day** and **Day of the Week** provides a strategic view of how these events are distributed over time. This perspective allows for clearer identification of **recurring patterns**, such as more **critical hours** or **days**, helping the team anticipate potential **risks**. As a result, it becomes possible to reinforce **operations** at the right moments, improve **response capacity**, and apply more effective **preventive actions**.

To make this reading even more intuitive, we use the same **time segmentation** mentioned in the previous item, ensuring **consistency** in the analysis and facilitating **data-driven decision-making**.

For example, if there is an increase in **incidents** on **Monday**, this may indicate a natural **overload** after the weekend, whether due to **task accumulation**, **service restarts**, or increased **system usage**. In this case, it may be worth considering **team reinforcement** or a **process review** on that day. The goal is to **anticipate problems**, ensuring the right **resources** are available when they are most needed.

### **Incidents per Origin**

Understanding the **origin** of incidents is essential to identify where problems are truly coming from. This visibility allows teams to map failures in specific systems, such as **APIs**, **integrations with external platforms**, or **critical parts of the internal infrastructure**. By clearly knowing the origin, teams can act with greater **precision** and **agility**, focusing on what truly needs to be fixed.

If a specific origin, such as **API-Auth** or **API-Report**, is consistently related to incidents, this is a clear sign that the area needs **attention**. In such cases, it is possible to concentrate efforts on:

* Improving **code quality**,
* Increasing **automated testing**,
* Adjusting **integration processes**,
* Rethinking the **architecture** of the solution.

By doing so, not only are failures reduced, but **confidence** in the systems and in the teams that maintain them is also strengthened.

## **Glossary of Technical Terms**&#x20;

**Incidents**: An event that has a real **impact**, such as a **failure** or **interruption**. Continuously tracking these **incidents** is essential to prevent larger issues and ensure a **quick and effective response** from the team.

**Insights Center**: Core module of the **Elven Platform** that provides in-depth analysis of **operational** and **business data**, supporting **strategic decision-making** and **performance improvement**.

**Total Incidents**: **Metric** that indicates the total number of **incidents** recorded over a period, providing a view of the **magnitude** of impactful events.

**Incidents Day of Week**: **Metric** that shows the daily distribution of **incidents**, allowing identification of **peaks** and **patterns** over time.

**Average/Total Response Effort**: **Indicator** that measures the time spent resolving **incidents**, helping assess **team efficiency**.

**MTTA**: **Average time** the team takes to acknowledge an **incident** after it is recorded.

**MTTR**: **Average time** required to resolve an **incident** after it has been acknowledged.

**Acknowledgment Rate**: **Percentage** of **incidents** that were quickly acknowledged, indicating the team's **effectiveness** in initial response.

**Postmortem Rate**: **Percentage** of **incidents** that underwent an **analysis**, aimed at **learning** and **preventing recurrence**.

**Incidents Time Cluster**: **Grouping** of **incidents** based on the time they occur, allowing identification of **critical impact periods**.

**Time Cluster Distribution per Month**: **Metric** that organizes **incidents** across months, helping identify **seasonal trends**.

**Incidents Day of Week**: **Metric** that distributes **incidents** by **day of the week**, enabling **resource planning** and **strategic adjustments**.

**Incidents per Origin**: **Classification** of **incidents** based on their **source**, such as **internal systems**, **APIs**, or **external integrations**.

**Incidents Hour Interval**: **Distribution** of **incidents** across **time intervals** throughout the day, allowing identification of **activity peaks**.

**Combo Events/MTTs/Average Response Effort**: **Consolidated view** that combines **response time**, **resolution time**, and **effort metrics**, providing a unified analysis of **incident impact**.
