--- layout: markdown_page title: "Category Direction - Alert Management" --- - TOC {:toc} ## Introduction and how you can help Thanks for visiting this category page on Alert Management in GitLab. This page belongs to the Health group of the Monitor stage, and is maintained by [Sarah Waldner](https://gitlab.com/sarahwaldner) who can be contacted directly via [email](mailto:swaldner@gitlab.com). This vision is a work in progress and everyone can contribute. Sharing your feedback directly on issues and epics at GitLab.com is the best way to contribute to our vision. If youâ€™re a GitLab user and have direct knowledge of your need for Alert Management, weâ€™d especially love to hear from you. ## Overview The cost of IT service disruptions continues to increase exponentially as every company becomes a tech company. Services that were previously offered during "business hours only" are now 24/7 and are expected to adhere to [6 nine's of uptime](https://en.wikipedia.org/wiki/High_availability#%22Nines%22). Moreover, operating these services becomes increasingly complex in this age of digital change. New technologies are emerging in the market on a daily basis, software development teams are moving to [CI/CD frameworks](https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/), and [legacy platforms are evolving into globally distributed networks of micro-services](https://www.gartner.com/smarterwithgartner/4-steps-to-design-microservices-for-agile-architecture/). It is critical for [modern operations teams](https://www.gartner.com/smarterwithgartner/5-steps-to-build-agile-infrastructure-operations/) to implement an accurate and flexible IT alerting system that enables them to detect problems and solve them proactively. Teams responsible for maintaining available and reliable services require a stack of tools to monitor the different layers of technology that comprise software services. These tools capture events (changes in the state of an IT environment) and generate alerts for critical events that indicate a degradation in application or system behavior. The complexity of IT applications, systems, and architectures and the many tools required to monitor them causes multiple problems for operators with regards to alerting. First, it is very challenging to figure out the correct metrics to monitor and track and the right thresholds to alert on. Most teams end up defining alerts too broadly in fear of missing critical issues. This results in a constant barrage of alert notifications where the problem is further exacerbated when multiple tools are concurrently alerting. When this happens, teams are forced to react to problems versus proactively mitigating them because they can't keep up with the stream of alerts and are always switching in between tools and interfaces. This causes 'alert fatigue' and leads to high stress and low morale. What these teams need is a single central interface that aggregates alerts from any source or multiple sources. The alert system should provide automatic deduplication and event correlation which will enable operators to efficiently triage and prioritize problems for resolution. ### Mission Our mission is to close the gap between outage detection and service restoration for DevOps teams by consolidating alerts in the same application where they investigate metrics, logs, traces, and errors and resolve incidents. ### Challenges As we invest R&D in adding Alert Management to GitLab, we are faced with the following challenges: * Well-entrenched market leaders want to own [alerting](https://www.datadoghq.com/alerts/) and want their platform to be the single (and only) place their customers have to manage alerts. They are incentivized to make alert fatigue a thing of the past. * Customers are not going to purchase GitLab for Alert Management alone because it is dependent upon many other GitLab features. ### Opportunities We are uniquely positioned to take advantage of the following opportunities: * Alert consolidation is a table stake of incident management platforms. In addition to being the place where users respond to and manage alerts, we are also where they can investigate and remediate outages. * We are well-practiced in building [boring solutions](https://about.gitlab.com/handbook/values/#boring-solutions) and [iteration](https://about.gitlab.com/handbook/values/#iteration). This will enable us to quickly add an interface to GitLab that aggregates alerts and immediately eliminates triage across multiple tools. * We can dominate the incident response market for [cloud-native applications](/handbook/product/application-types/#cloud-native-web-applications) where incumbent players (such as [ServiceNow](servicenow.com)) have been slow to meet user requirements for an integrated, robust understanding of the health of a complex micro-services based application. ## Target Audience and Experience While Alert Management matures through minimal and viable, we are creating an intuitive and streamlined experience for [Allison (Application Ops)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#allison-application-ops) and [Devon (DevOps Engineer)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#devon-devops-engineer). Initially, this experience will be oriented towards DevOps teams at smaller companies where it is common for the engineers to be on-call and responding to alerts for the software that they also write code for. ## Strategy ### Maturity Plan We are currently focused on moving **Alert Management** from the `planned` to the `minimal` maturity level and that work is captured in this [epic](https://gitlab.com/groups/gitlab-org/-/epics/2877). Definitions of these maturity levels can be found on [GitLab's Maturity page](https://about.gitlab.com/direction/maturity/). ### What is Next & Why? Processing alerts during a fire-fight requires responders to coordinate across multiple tools to evaluate different data sources. This is time consuming because every time a responder switches to a new tool, they are confronted with a new interface and different interactions which is disorienting and slows down triage workflows. The `minimal` version of Alert Management will be an interface in GitLab that aggregates alerts from any tool. Similar to how your application runs on a stack of technology, there is a stack of monitoring tools that ensures each layer of your tech is reliable and available. There are hundreds of tools in market. We want to consume alerts from all of them. You can follow our progress and contribute to the MVC via this [epic](https://gitlab.com/groups/gitlab-org/-/epics/2877). Once we've created an interface where you can view alerts from different tools side by side, we will enrich that experience by enabling you to interact with them and take action on them. ### What is not planned right now This is a new category and we are still refining our vision. We will add items to this section as we move through research and prioritization. ## Competitive Landscape * [Splunk Alert Manager](http://docs.alertmanager.info/en/latest/) * [MoogSoft Alerting](https://docs.moogsoft.com/AIOps.7.0.0/Alerts_Overview.html) * [OpsGenie](https://docs.opsgenie.com/docs/alerts-page) * [BigPanda](https://docs.bigpanda.io/docs/reference-incidents-tab#section-alerts-tab) ## Analyst Landscape Not yet, but accepting merge requests to this document. ## Top Customer Success/Sales Issue(s) Not yet, but accepting merge requests to this document. ## Top Customer Issue(s) Not yet, but accepting merge requests to this document. ## Top Internal Customer Issue(s) Not yet, but accepting merge requests to this document. ## Top Vision Item(s) Not yet, but accepting merge requests to this document.