Management Guide for the Resource Center of the Elven Platform

The Resource Center of the Elven Platform was designed to simplify your daily routine and provide a clear view of your organization’s resources. Here, you can monitor in real time the operational status, essential SRE metrics, and the event history—all in a practical and accessible way. With an intuitive and easy-to-navigate interface, the Resource Center turns data into useful insights to help your team make decisions quickly and confidently.

And if you need to add new resources, simply access our Services Hub, where everything becomes even more streamlined and integrated—or just click the + Resource button and you’ll be redirected there.

Accessing the Resource Center

  • Navigate to the main menu and click on Monitoring.

  • In the submenu, select the Resources item.

Working with the Resource Center

The homepage of the Resource Center was designed to make your navigation simple and efficient. Right away, you have access to a list of resources presented in an organized format, with columns showing the current status of each resource and its corresponding name. To make things even easier, there is a practical search bar where you can search for resources by name and apply filters by status (All, Inactive, Operational, Pending, In Maintenance, and Not Operational), helping you quickly and easily find exactly what you need.

By clicking on a resource, you’ll be taken to a resource page designed to provide all the necessary information in a clear and accessible way. In the General Information section, you can view the current status of the resource with color-coded visual indicators for easier identification. Additionally, you can manage monitoring with a simple toggle button to enable or disable monitoring for the resource.

In the Reliability Metrics section, you’ll find key data such as Mean Time Between Failures (MTBF), Mean Time to Acknowledge (MTTA), and Mean Time to Recovery (MTTR), along with an uptime history displayed in interactive charts showing periods of 1 hour, 6 hours, 24 hours, and up to 365 days.

In the Resource Center of the Elven Platform, you also have access to detailed Percentile Latency Charts, which display average response times (p50) and higher latency cases (p90 and p95). These charts are designed to simplify analysis, with colored lines that highlight performance peaks and help identify patterns in a visual and intuitive way. Additionally, there are real-time latency charts that allow you to monitor system performance as it happens.

These metrics are essential for understanding system behavior: spikes may indicate periods of higher load or intensive processing, while lower values suggest smooth operation. This continuous monitoring helps your SRE team quickly identify bottlenecks or anomalies and act proactively to prevent failures or performance degradation.

Finally, in the Events History section, you can locate incidents by name, while advanced filters by status, severity, origin, and time period help refine your search, providing an even more efficient experience. Each event displays details such as current status, description, date and time, and a link to access the incident page with detailed technical information, ensuring you have everything you need to act quickly and effectively.

Glossary of Technical Terms

Resource Center: A centralized area in the Elven Platform dedicated to the management and monitoring of the organization’s resources. It allows you to view operational status, essential SRE metrics, and event history—all in real time and through an intuitive interface.

SRE (Site Reliability Engineering): An engineering practice focused on improving the reliability, performance, and scalability of systems. The Resource Center provides essential metrics for continuous monitoring and incident resolution related to SRE.

Services Hub: A module that centralizes the available services for integration, allowing you to add new resources in a simple and organized way.

Operational Status: An indicator that shows the current condition of a resource—whether it is operational, experiencing failures, or under maintenance.

Filters: A feature that allows you to refine resource searches based on criteria such as status.

  • All: Displays all entries, regardless of current status.

  • Inactive: The resource is configured but not currently active.

  • Operational: The resource is functioning normally, with no detected issues.

  • Pending: The resource is in the process of being activated or awaiting action before going live.

  • In Maintenance: The resource is undergoing planned maintenance and may be temporarily unavailable.

  • Not Operational: The resource is down or experiencing failures, impacting its functionality.

Incidents: Critical events that affect system operations and require corrective action.

Reliability Metrics: A set of metrics related to system reliability, including:

  • MTBF (Mean Time Between Failures): The average time between failures, measuring system reliability.

  • MTTA (Mean Time To Acknowledge): The average time to acknowledge an incident.

  • MTTR (Mean Time To Recovery): The average time to recover a resource after a failure.

  • Uptime: The percentage of time the system has remained operational, visualized through interactive charts.

Interactive Uptime Charts: Charts that display a resource’s uptime over different periods (1 hour, 6 hours, 24 hours, and 365 days). These charts help analyze the continuity of operations over time.

Percentile Latency Charts: Charts that show a resource’s response times, divided into percentiles, including:

  • p50 (50th percentile): The average response time.

  • p90 (90th percentile) and p95 (95th percentile): Cases of higher latency or slower response times.

Real-Time Latency Charts: Visualizations that display a resource’s latency performance in real time, allowing you to monitor response times as they occur.

Latency Analysis: The process of analyzing response times to identify system behavior patterns. Latency spikes may indicate overload or issues, while lower values suggest stable performance.

Event History: A section where all events and incidents are listed, with details such as description, current status, date, and time. Advanced filters can be applied to search for events by status, severity, origin, and time period.

Advanced Filters: A feature that allows you to refine searches for events and resources based on criteria such as status, severity, origin (e.g., affected system or service), and time period (e.g., last month, last week).

Incident Severity: A classification of incidents based on their impact on the system. Severity helps prioritize the response to more critical events.

Incident Origin: The origin of an incident indicates the location or service that caused the event, which could be a system failure, human error, or an issue with an external service.

Last updated

Was this helpful?