---
layout: markdown_page
title: "Monitor workflow - Triage"
---

- TOC
{:toc}

# Triage

This page contains a description of Gitlab triaging workflow vision as apart of our [Monitor](https://about.gitlab.com/handbook/engineering/development/ops/monitor/) stage.

## Why Triage?

Triage is a process of detecting and identifying application performance bottleneck, intending to understand the root cause of the problem quickly and accurately. 
To conduct fast and effective troubleshooting, you need to have collected and have access to all the relevant information in order to appropriately diagnose the degradation. In this process, we aim to provide meaningful insights using deep visibility into all segments of an application. 
This way when it breaks, we'll help you figure out why quickly.

## User Journey
### Starting point

Triaging flow usually starts with an alert or a customer complaint. Gitlab will immediately alert you, via Email, Slack, Pagerduty or 
any other 3rd party tool about your application's health.

Once the alert has been triggered, it is examined, and a verification process begins to understand if this is a real problem and whether it is undergoing. It is also recommended 
to look at known issues, afterward the alert is acknowledged and assigned to the right team for further investigation.

### Understand the business impact

An essential part of the triaging flow is understanding the business impact of a problem. Is it a wide system failure? Does it affects all users or just a subset? 
Different business impact dictates the level of urgency and course of action to take. For example, the selection of run books to follow, or the recommendation of team members to collaborate with.

### Collaboration

Collaboration is critical for successful troubleshooting of an incident. Oftentimes different teams will need to work together to 
reduce the MTTR (Mean Time To Resolution) and to the detect the root cause of the problem. As a result, actions such as involving other teams, internal communication, notify stakeholders, all need to take place

### Ad-hoc investigation
To conduct an effective investigation of a problem, it is expected that an observability solution would have:
* Visualization of your application with a bird's eye view 
* Ability to sort and filter 
* Create ad-hoc dashboards
* Quickly dive into every component in the system
* Visualize how services interact and find bottlenecks in your application
* Easy pivoting between traces metrics and logs while preserving the right context

Once the investigation is over, it is common to document the finding for future analysis

## What's next

We plan to provide a streamline triage experience to allows our users to quickly identify and effectively troubleshoot an application problem as described in the following flow:

``` mermaid
graph TB;
A[Alerts] -->|Embedded Metric Chart in Incident|B
B[Metrics] -->|Timespan Log Drilldown|C
C[Logs] -->|TraceID Search|D[Traces]
```
 
Detailed information can be found in the [triage to minimal epic](https://gitlab.com/groups/gitlab-org/-/epics/2225)