"What is incident management in ITSM, why is it important, and how can I do it right?" Find out in this episode of Appfire Presents: The Best IT Service Management Show by Appfire.
Nick Nader of Isos Technology joins Kerry O’Shea Gorgone to explain incident management in ITSM: what it is, why it matters, and how to do it. We cover the definition of incident in the IT context, then dive into how you can set your organization up for success through planning, technology, and communication.
To get deeper into the incident management topic, check out this article from Isos Technology: "How Jira Service Management Streamlines Service Operations: Incident Management."
About the guest
Nick Nader is a Solutions Engineer at Isos Technology. With a passion for innovative technology and a background in software test automation, quality assurance, and the CI/CD lifecycle, Nick designs custom solutions to meet organizations' business needs and help them get the most out of their Atlassian tools.
He holds Atlassian certifications in Jira Service Management and Agile with Jira, and is a Certified SAFe Program Consultant. When he's not behind a computer screen, he enjoys outdoor activities like hiking with his wife and dog, snowboarding, and playing pick-up sports.
About the show
The BEST ITSM Show by Appfire brings you expert insights for IT service delivery, so your employees and customers have what they need to succeed. Get the right tech and tips for the right job at hand. Look like you’ve come from the future with all your new ITSM smarts. Every episode is a brisk 10 minutes—less time than it takes to provision a laptop or troubleshoot a tech support issue.
For your convenience, here is the transcript of this episode:
What is incident management in ITSM, why is it important, and how can I do it right?
Kerry: Today we’re going to answer the question what is incident management in ITSM, why is it important, and how can I do it right. To help us answer that question is Nick Nader, a solutions engineer at Isos Technology. He holds Atlassian certifications in Jira Service Management and Agile with Jira and is a certified SAFe program consultant. Stick around for 10 minutes of ITSM awesome.
Thanks so much for being here, Nick. What is incident management, or what’s an incident, actually?
Nick: That’s a great question. Incident management is a facet, a part of ITSM. When we talk about IT service management, it’s all inclusive for managing interactions with your customers. Service requests, incidents, problems, changes, all of those things are kind of under the ITSM umbrella. Incidents is one of those major parts of ITSM that is super important because you have to be able to resolve issues that your clients and end users are having with your services, or your systems, or your products.
Kerry: Are these things that are outside of the day to day, things that disrupt your day to day operations?
Nick: Exactly, yes. You can define incident as a disruption of services. Either a degradation of service or a complete outage of services where you can’t get to a website or you can’t use a product that is hosted somewhere. It’s basically loss of service or disruption.
Kerry: I’m guessing you want to have some idea of what you’re going to do before it happens. You don’t want to just make it up as you go.
Nick: Exactly. You have to have a plan. You essentially have to have an incident resolution process, and your help desk agents should be churched up on that process so they don’t just scramble with their heads cut off when an incident comes in from the help desk. That’s a big part of it, documenting your process, having your agents churched up on that documentation of the process, and just making it readily available within the tools so they can reference that process and take the right steps to resolve that incident, and loop in the teams, the stakeholders, and the necessary people that it’s going to take to resolve that incident.
Kerry: It’s not like the first time it’s ever happened just because it’s the first time they’re seeing, as the individual person seeing it.
Nick: Very true. There’s a couple different classifications of incidents. You have your standard incidents, normal incidents, and then emergency incidents. Your standard ones are normal incidents that maybe haven’t occurred before but aren’t emergencies, not total system down or anything like that.
Your normal ones are ones that are repetitive incidents, like you said, where they just happen over and over again, so you already have a structured process around them. Then, of course, emergency ones are ones that are total system outages, total disruption service that you have to put out very quickly because that’s a fire that is affecting a huge portion of your business.
Kerry: If you have a really sophisticated tech stack, I would think that the stakes are really high because any one of the pieces in that could go out and affect all the rest of them, so that would maybe be an emergency, like now we can’t do anything.
Nick: Right. Absolutely. When you look at Jira Service Management, they always have default fields like affected service, and that’s usually what your end users are opening up an incident against is this is the service or product that I’m having an issue with. That on the back end can actually be tied to all of the other services that it has dependencies with. If I’m having an issue with AWS, and we’re hosting a bunch of stuff on AWS, all of those other services could be down because we’re having an incident with AWS.
You have to have that dependency map between your services so that when you have incidents that are affecting one and could be affecting several because of that, you know where those fires are, which services are down, and how they relate to each other.
Kerry: That’s the stick. What’s the carrot? What goes better when you have good incident management?
Nick: Great question. A proper incident management process is going to enable your customers to open incidents very easily. A help desk portal with the right request form, with the right fields, making it easy for them to select the product that they’re having an issue with, making it very easy for them to attach error screenshots and things like that. Just sourcing the incidents from your end users is super helpful.
Atlassian also has a product called Opsgenie that integrates with a bunch of monitoring tools so that you can actually source incidents from your systems themselves and get to them before your end users experience that outage or whatever the incident may be.
Kerry: So, when nobody has complained yet?
Nick: Yes. Exactly. You can cut it off before it gets to that customer. Your monitoring systems are basically built for that. What Opsgenie does is it integrates with all of those monitoring systems, takes the data payload from all of those monitoring systems, and then parses it and cleans out the gunk that you don’t care about, and then routes the alerts and data to the proper people.
You can have a team per product in Opsgenie, and those teams are the ones that are going to get alerts when the monitoring systems for their product start pinging them, start sending SMS messages. You can route it and escalate it to those different teams. They have on-call schedules and things like that. You can get them in text form or phone calls if it’s an emergency.
There’s a lot of different methods of communicating out those alerts coming from your monitoring systems through that Opsgenie product. Super helpful if you don’t want your customers to find it first.
Kerry: How often do you go in and refresh settings on whatever kind of app you have making sure that your instance doesn’t die, how often should you check?
Nick: That’s a great question. Maintenance of the product that’s monitoring the systems, for sure, it’s something that you should probably have dedicated admin for. Opsgenie typically is going to have an administrator that will do routine maintenance, make sure the integrations are working, but the system itself also has error logs and audit logs.
It will tell you the connection isn’t there anymore. It can have heartbeat monitors, so if I don’t get a ping from our system every five minutes or one minute, then we’re going to have an issue and we’re going to send off an alert, so it can even alert you about those kinds of things.
Kerry: You mentioned having a full time person dedicated to updating that stuff. That person is going to need to be informed about team changes, too. Or do you recommend tying alerts to roles as opposed to individual people, because people do move around?
Nick: For sure, they do. The alerts are usually routed to a team as a whole. We’ll have a team for a service or a product, and that team’s members could change. Your team members could shift between teams, they could be part of multiple teams, but as long as they’re built into that team configuration and have a section or a rotation in the on-call schedule, then they’re going to be getting the alerts from whichever teams they’re a part of.
I think that you definitely need a dedicated administrator, but that administrator could be administrating multiple products. Not to say that you need a full time engineer just for Opsgenie. You definitely don’t need a full time engineer just for Opsgenie. Once you get it set up, the maintenance is relatively lightweight, so it’s usually and administrator that’s doing several of your systems that is handling the Opsgenie side as well.
Kerry: What’s the worst situation you’ve ever seen and how did you fix it? Have you ever seen a big mess and gone, “If something actually happens, we’re dead.”
Nick: That’s a great question. We’ve had clients that have run into those issues. Luckily, I’m not a full time incident management development or operations manager, so I haven’t had to put out any fires personally. But our clients put them out all the time, and we’ve created systems that help them handle those major fires.
Clients that have very large websites. Clients that host and run streaming platforms like the one that we’re on right now. Ones that are constantly live all of the time, they’ve had major issues where their services go completely down. Maybe something like Cloudflare or another DNS provider goes down, and 20% of the internet goes down. That’s a huge issue for their services and systems, but as long as they have that incident resolution process in place, those monitoring tools in place, they’re going to be able to hop on that extremely quickly, it’s going to wake up managers and stakeholders if they’re in a bad time zone or it happens overnight.
So, people are going to be able to jump on that fire very quickly. The tools kind of facilitate that quick communication and collaboration with things like hosted bridges for specific major incidents, so you can spin up a video conference bridge very easily, you can spin up a chat room and channel dedicated to an incident very easily with just one click of a button. The communication is all there and status updates are there during resolution so that stakeholders don’t have to constantly be on that video conference, but they can get status updates from the people that are doing the resolution of that incident and resolving it.
The other piece is tying in the development and the operations teams. It kind of ties through Opsgenie, allowing you to tie in those developers, their source code repositories, so the people that are actually going to make the changes that fix that incident are also tied into this process even though they don’t live in your help desk or live in Opsgenie. They maybe live over in Jira Software or other development tools, and they can tie into incident management process because the tool facilitates that integration so that they’re also aware of it and they can quickly dig into a code commit or a deployment that they pushed recently and see was this the root cause of our incident and can we resolve the root cause.
That’s starting to get into problem management, but incident, problem, and change management are all related. The incident happens first, you try to dig into the root cause of what caused the incident, and that’s problem management. Then you perform a change to fix that root cause, which is change management. It’s all kind of tied together. Incident management is just the beginning of that chain.
Kerry: I feel like I would be remiss if I didn’t point out that Appfire has Configuration Manager for Jira which you could set up to avoid issues. Also, Guardrails is a new feature in Jira that can help you set things up, set parameters so that if your instance is going to die, or if you go over a certain limit, you can have it alert you before that happens.
Nick: Right.
Kerry: You really have to just kind of set the stage for success and then maintain is what it sounds like.
Nick: Exactly.
Kerry: And it’s never too late.
Nick: Yes. And there’s never too many safety nets. Having Guardrails up is fantastic because it’s just another safety net for your systems that are business critical, basically.
Kerry: If you need more information about this topic (and who doesn’t) go to Blog.IsosTech.com/incidentmanagement, and you’ll find a lot of information there. Nick, thank you so much for talking with me about this today. I learned a lot. I hope not to cause incidents at work.
This has been The Best ITSM Show by Appfire. You can find more episodes at Hub.Appfire.com. Thanks. We’ll see you next time.