
Imagine one of your apps stops working, leaving users frustrated and unable to do their work. Or say you run online events and your video platform goes down right before a live webinar.
These are examples of the types of events that require an emergency response, quick resolution, and a solid incident management process.
What is incident management in ITSM?
According to industry expert Nick Nader, solutions engineer at Isos Technology, incident management is a vital process that IT operations and DevOps teams use to tackle unexpected events that can disrupt service quality and operations (like the ones mentioned above). Incident management enables you to promptly identify and resolve issues, while preserving normal service and minimizing business impact.
In Nader’s view, “Incident management is a facet, a part of IT service management (ITSM). When we talk about IT service management, it’s all-inclusive for managing interactions with your customers. Service requests, incidents, problems, changes — all of those things are under the ITSM umbrella. ‘Incidents’ is one of those major parts of ITSM that is super important because you have to be able to resolve issues that your clients and end users are having with your services, or your systems, or your products.”
“You can define an incident as a disruption of services. Either a degradation of service, or a complete outage of service where you can’t get to a website or you can’t use a product that is hosted somewhere. It’s basically a loss of service or a disruption.”
Depending on the level of disruption, different incidents require a different course of action. According to Nader, “You have your standard incidents — normal incidents — and then emergency incidents. Your standard ones are normal incidents that maybe haven’t occurred before but aren’t emergencies, not total system down or anything like that.“
“Your normal ones are ones that are repetitive incidents, where they just happen over and over again, so you already have a structured process around them. Then, of course, emergency ones are ones that are total system outages, total disruption service that you have to put out very quickly because that’s a fire that is affecting a huge portion of your business.”
Why is incident management important?
An effective incident management process enables your customers to open incidents easily. As Nader explains, “a help desk portal with the right request form, with the right fields, makes it easy for them to select the product that they’re having an issue with — making it very easy for them to attach error screenshots and things like that. Just sourcing the incidents from your end users is super helpful.”
What are the steps in the incident management workflow?
To ensure a systematic approach to managing incidents, minimizing disruption, and bringing services back, you’ll need to create a workflow that includes a few simple steps.
An incident management workflow typically includes:
- Incident identification: Employees or clients start reporting the network connectivity issue to IT via phone, email, or an incident reporting tool.
- Incident logging: The IT staff logs each reported incident, capturing details like the user's name, contact information, time of the incident, and a brief description of the problem.
- Incident categorization and prioritization: IT categorizes incidents based on the reported issue, such as "network connectivity" or "internet access." They prioritize incidents based on the number of affected users and the importance of their work.
- Initial diagnosis and investigation: The IT support team starts investigating the issue. They check network equipment, review system logs, and analyze recent changes to identify the root cause.
- Incident escalation: If the IT support team cannot resolve the issue within a specified time or lack the necessary expertise, they escalate the incident to the network team or the network service provider for further assistance.
- Incident resolution and workaround: The network team identifies what caused the outage. They replace the faulty hardware (or fix whatever the issue is) and restore connectivity for the affected users. During the resolution process, they also implement a temporary workaround by redirecting network traffic through an alternate path to make sure users can still connect.
- Incident closure and documentation: Once IT has restored network connectivity, the team updates incident records, documenting the actions taken, the root cause, and the steps taken to resolve the issue. They formally close incidents in the incident management system.
- Incident communication: The IT help desk stays in contact with affected employees or clients throughout the incident. Their goal is to provide updates on the situation, let people know when the problem will be fixed, and to inform them of any temporary workarounds.
- Incident review and analysis: Once the issue is resolved, the IT team conducts a post-incident review. They analyze the root cause and identify steps to prevent similar outages in the future.
These steps demonstrate how incident management in ITSM addresses and resolves an incident that disrupts the normal functioning of an organization's infrastructure. By effectively managing incidents, organizations can minimize the impact on productivity, restore services promptly, and prevent similar incidents from occurring in the future.
What are some incident management best practices?
There are some best practices you can use to achieve swift and efficient incident resolution. “You have to just kind of set the stage for success and then maintain it as you go, and there’s never too many safety nets,” as Nader observes.
Keeping that in mind, here are 10 incident management tips to help you minimize downtime:
- Clearly define incident priorities based on impact and urgency. Establish a well-defined system that ensures incidents receive appropriate attention and resources, preventing any confusion or delays.
- Implement a centralized incident management system that becomes the heart of your operations. A single source of truth streamlines communication, enhances collaboration, and keeps everyone on the same page.
- Establish consistent and timely communication channels during incidents. Keep affected users and stakeholders informed about progress, expected resolutions, and any available workarounds.
- Assemble a dedicated team of incident management champions. Equip them with the right skills and knowledge to handle incidents with precision and grace. This team will be your guiding light in times of trouble.
- Create effective escalation procedures, so that incidents requiring higher-level support are escalated promptly. No delays, no bottlenecks — just seamless handoffs to the right people.
- Foster a culture of continuous improvement and learning. Conduct thorough post-incident reviews to uncover root causes, identify process gaps, and implement improvements.
- Provide comprehensive training to your incident management team. Equip them with the know-how to handle incidents swiftly and effectively. Maintain up-to-date documentation and knowledge bases for easy access to information.
- Encourage collaboration and knowledge sharing among team members. Break down silos and foster cross-functional communication. Set up regular meetings to share best practices so your team can benefit from everyone’s collective wisdom.
- Use incident data and metrics to make data-driven decisions for preventive maintenance or system enhancements. Monitor trends and identify recurring issues, so you can act accordingly to prevent them from happening again.
- Conduct regular incident management drills to keep your skills sharp. Simulate scenarios to test your processes, identify weaknesses, and provide hands-on training.
Incidents give your team an opportunity to improve their response every time. If you keep that in mind while also following best practices, you will be on your way to creating a well-oiled incident response machine.
Which incident management tools should you consider?
Nader recommends Atlassian’s Opsgenie because it “integrates with other monitoring tools so that you can source incidents from your systems themselves and get to them before your end users experience that outage or whatever the incident may be.”
Two other options are Appfire’s Slack Integration+ for Jira, which integrates Jira Software with Slack, or Microsoft Teams Integration+ for Jira, which integrates Jira Software with Microsoft Teams. So, when an incident occurs, teams can spin up a quick “war room” from Jira Software, and the appropriate team members will be pulled into a dedicated Slack or Microsoft Teams channel where they can swarm on an incident and resolve it.
Time to SLA is another powerful app that enables teams to create service level agreements (SLAs) for how quickly incidents should be resolved within Jira Service Management and Jira Software.
Teams can also create response templates with Canned Responses Pro Templates for Jira to save time spent manually typing the same response repeatedly within Jira. If an incident occurs, for example, then multiple people across an organization are likely to report it to their IT, Support, or Dev team. These teams can create a template in Canned Responses so support agents can quickly reply to all of these tickets with correct, consistent information.
With the right tools, you can customize your incident management processes to suit your culture and workflows.
When clients and end users have trouble accessing your services, systems, or products, it’s mission-critical to get the problem resolved as soon as possible. You also want to stay in touch with users who are potentially growing more frustrated by the minute, so they know what to expect (and feel reassured that you care). You can achieve all this through incident management.
Incident management enables IT operations and DevOps teams to jointly tackle unexpected events that can disrupt service quality and operations in the most streamlined and efficient way possible. By using several carefully crafted steps, these teams can ensure a systematic and prompt approach to managing incidents, minimizing disruption, bringing services back — and reduce anxiety on your team! That’s a win/win.
Learn moreTo view a video interview on this topic with Nick Nader, please click here.
