--- layout: markdown_page title: "Category Direction - Incident Management" --- - TOC {:toc} ## Introduction and how you can help Thanks for visiting this category page on Incident Management in GitLab. This page belongs to the Health group of the Monitor stage, and is maintained by [Sarah Waldner](https://gitlab.com/sarahwaldner) who can be contacted directly via [email](mailto:swaldner@gitlab.com). This vision is a work in progress and everyone can contribute. Sharing your feedback directly on issues and epics at GitLab.com is the best way to contribute to our vision. If you’re a GitLab user and have direct knowledge of your need for incident management, we’d especially love to hear from you. ## Overview Downtime costs companies an average of $5,600/minute, [according to Gartner](https://blogs.gartner.com/andrew-lerner/2014/07/16/the-cost-of-downtime/). This number, though an estimate based on a wide range of companies, communicates that downtime is expensive for organizations. This is especially true for those who have not invested in culminating process and culture around managing these outages and resolving them quickly. The larger an organization becomes, the more distributed their systems and teams tend to be. This distribution leads to longer response times and more money lost for the business. Investing in the right tools and fostering a culture of autonomy, feedback, quality, and automation leads to more time spent innovating and building software and less time spent reacting to outages and racing to restore services. The tools your DevOps teams use to respond during incidents critically affect [MTTR (Mean Time To Resolve, also known Mean Time To Repair)](https://en.wikipedia.org/wiki/Mean_time_to_repair) as well as the happiness and morale of team members responsible for the IT services your business depends on. A robust incident management platform consumes inputs from all sources, transforms those inputs into actionable incidents, routes them to the responsible party, and then empowers the response team to quickly understand and remediate the problem at hand. Moreover, this platform should also guide Post Incident Reviews following the fire-fight that makes it easy for the team create and track after-action items for continuous improvement. ### Mission Our mission is to help DevOps teams reduce MTTR via actionable incidents, seamless integrations with communication tools, and by supporting continuous improvement via Post Incident Reviews and system recommendations. ### Challenges As we invest R&D in building out Incident Management at GitLab, we are faced with the following challenges: * The market is dominated by Incident Management companies that have been around for longer. Specific examples include: * [ServiceNOW](https://www.servicenow.com/) - founded in 2003 * [PagerDuty](https://www.pagerduty.com/) - founded in 2009 * [Splunk VictorOps](https://victorops.com/) - founded in 2012 * [Atlassian Opsgenie](https://www.opsgenie.com/) - founded in 2012 * We lack brand identification with Enterprise Ops buyers (also mentioned on the [Ops Vision page](https://about.gitlab.com/direction/ops/#challenges)) * Customers are not going to purchase GitLab for the Incident Management product alone because it is dependent upon many other GitLab features. ### Opportunities We are uniquely positioned to take advantage of the following opportunities: * Colocation of code and incidents significantly reduces context switching and accelerates [MTTR](https://en.wikipedia.org/wiki/Mean_time_to_repair). We are easily able to correlate development events such as merge requests and deploys with incidents, shortening the time it takes to find the root cause and automates some of the work required to prepare a timeline of events necessary for Post Incident Reviews * We are well-practiced in building [boring solutions](/handbook/values/#boring-solutions) and [iteration](/handbook/values/#iteration). This will enable us to quickly produce a simple version of Incident Management "just-good-enough" to displace overly complicated existing solutions, while rapidly iterating over the long term towards a lovable product in this category. * We can dominate the incident response market for [cloud-native applications](/handbook/product/application-types/#cloud-native-web-applications) where incumbent players (like ServiceNow) have been slow to meet user requirements for an integrated, robust understanding of the health of a complex micro-services based application * We can uniquely serve the needs of Operations Managers who struggle to answer the question - "Are my teams spending all their time firefighting, or are they proactively managing the health of their applications?" * We can repurpose many existing features within GitLab when we design workflows for Incident management. This will enable us to achieve: * Accelerated time to market * Quick iterations * Faster feature adoption we are building on known workflows and concepts * Improvements to existing features will benefit a wider set of use cases beyond Incident Management ### High-level Design #### Incidents in GitLab We are leveraging GitLab's existing Issue features as a base for Incident Management. In its simplest form, an **Incident** should be the single source of truth (SSOT) for understanding: * The current state of the incident * The communication channels where an incident is being worked (Zoom, Slack, etc.) * Relevant environment changes such as commits, merge requests, code, releases * Monitoring artifacts such as alerts, errors, metrics, traces, logs * Annotations such as runbooks and chart visualizations #### Leveraging Existing Features **Incidents** will be based on GitLab issues, as mentioned above. This allows us to take advantage of the following features, accelerating how quickly can get software into the hands of customers for feedback: * [Issue boards](https://docs.gitlab.com/ee/user/project/issue_board.html) can be used to triage and organize incidents * [GitLab Flavoured Markdown (GFM)](https://docs.gitlab.com/ee/user/markdown.html#gitlab-flavored-markdown-gfm) and [issue templates](https://docs.gitlab.com/ee/user/project/description_templates.html#description-templates) allow users to open customized incidents and automatically assign them to the right team or label them to appear in the correct triage list * [ChatOps](https://docs.gitlab.com/ee/ci/chatops/README.html#gitlab-chatops) sends issue events to Slack using the [Slack notifications service](https://docs.gitlab.com/ee/user/project/integrations/slack.html#slack-notifications-service) and users can make changes to issues from Slack using [slash commands](https://docs.gitlab.com/ee/user/project/integrations/slack_slash_commands.html) Even though we are taking advantage of existing features to launch Incident Management, that does not mean we are not investing in new functionality. Read on to find out what we have planned for the future and what is up next. ## Target Audience and Experience Our current Incident Management tools have been built for users who align with our [Allison (Application Ops)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#allison-application-ops) and [Devon (DevOps Engineer)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#devon-devops-engineer) personas. The experience targets DevOps teams at smaller companies where it is common for the engineers to be on-call and responding to alerts for the software that they also write code for. As we mature this category, we will evolve the experience to appeal to and serve the enterprise customer. ## Strategy ### Maturity Plan We are currently working on maturing **Incident Management** from `viable` to `complete`. Definitions of these maturity levels can be found on [GitLab's Maturity page](https://about.gitlab.com/direction/maturity/). The following epics group the functionality we have planned to mature Incident Management. * [Complete](https://gitlab.com/groups/gitlab-org/-/epics/1539) * [Lovable](https://gitlab.com/groups/gitlab-org/-/epics/1494) ### What is Next & Why? Collaboration with teammates and actionable incidents accelerate the fire-fight by enabling efficient knowledge sharing, providing guidelines for resolution, and minimizing the number of tools you need to check before finding the problem. In support of this, we are considering the following functionality to move Incident Management to `complete`: * [Customizing incidents with alert attributes in issue templates](https://gitlab.com/gitlab-org/gitlab/issues/10744) so that individual response teams can make incidents relevant to the systems and applications they manage. * Improving how teams [triage and organize incidents](https://gitlab.com/groups/gitlab-org/-/epics/1435) ensuring multi-problem outages have awareness and are being addressed. * [Integrating with widely used paging, workflow, and ticketing tools](https://gitlab.com/groups/gitlab-org/-/epics/1438) to eliminate manual work required to update multiple systems. * [Integrating with Zoom](https://gitlab.com/groups/gitlab-org/-/epics/1439) so that responders can start conference bridges when needed within GitLab. * [Linking incident response runbooks](https://gitlab.com/groups/gitlab-org/-/epics/1436) to incidents to help the on-call responder reduce MTTR * A [Post Incident Review experience](https://gitlab.com/groups/gitlab-org/-/epics/1782) that empowers DevOps teams to continuously improve behavior and systems ### What is not planned right now These features are currently out of scope for Incident Management and are not planned for any maturity levels at this time. This does not exclude them from future considerations. * Paging * On-call schedules * Escalation * Remediation ## Competitive Landscape [Atlassian OpsGenie](https://www.opsgenie.com/) [Splunk VictorOps](https://victorops.com/) [PagerDuty](https://www.pagerduty.com/) [ServiceNOW](https://www.servicenow.com/products/incident-management.html) [XMatters](https://www.xmatters.com/use-cases/major-incident-management-mim/) ## Analyst Landscape Not yet, but accepting merge requests to this document. ## Top Customer Success/Sales Issue(s) Not yet, but accepting merge requests to this document. ## Top Customer Issue(s) Not yet, but accepting merge requests to this document. ## Top Internal Customer Issue(s) Not yet, but accepting merge requests to this document. ## Top Vision Item(s) Not yet, but accepting merge requests to this document.