Why Problem Management is Important to Security

I felt it important to share with you a perspective that I have gained from my extensive experience with information and physical security, combined with my recent experience with the Information Technology Infrastructure Library (ITIL), and more specifically problem management. ITIL defines problem management as “The process responsible for managing the lifecycle of all problems. Problem management proactively prevents incidents from happening and minimizes the impact of incidents that cannot be prevented” (Steinberg, Rudd, Lacy, and Hanna, 2011). Well, then, what is a problem defined as? ITIL would tell us that a problem is “a cause of one or more incidents. The cause is not usually known at the time a problem record is created, and the problem management process is responsible for further investigation” (Steinberg, et al, 2011).

As I learned more about ITIL, this definition of incident made me cringe. Why do they keep saying “incident?” To an ITIL based Information Technology (IT) shop, incidents (little i) happen all the time. Network outages, file servers unavailable, websites down and the like – these are examples of routine IT incidents. This is in stark contrast to what my security background would tell me is an Incident (big I). The United States Computer Emergency Readiness Team (US-CERT) defines a security Incident as “the act of violating an explicit or implied security policy…”

ITIL is a framework for IT. Problem management is a component of ITIL. So then, why is problem management important to information security? As my anxieties calmed and I learned to control my reaction to the word “incident,” I realized the value of problem management. ITIL would tell us that problem management matures service availability and overall quality improvements within IT. What I learned is that by applying effective problem management methodologies to identify the TRUE root cause of an incident (and now you choose the case of the I), security spend less time in firefighter mode (reactive), and more time in a more mature responsive mode to determine impacts. Knowing the root cause enables us to better understand, and communicate effectively, the impact to the business.

Understanding the root cause of an incident is much more than simply knowing the “thing” that caused an incident. In my experience, there is often a single technical cause identified. However, we should have defense-in-depth strategies that prevent a single failure from causing an impact. I have seen teams declare a root cause to be a technical finding, such as a power supply failure, a drive failure, a patch that didn’t go as planned, etc. My new experience would propose: Is that the root cause? Although it may be the technical snowflake that caused the avalanche, it is often not the root cause in my opinion. Why did one of those “things” cause an outage? In addition to technical causes, we can consider administrative root causes. Administrative causes may include (using the example above): lack of executive support for investing in redundancy, lack of preventative maintenance programs or failure to respond to an automated alert advising the redundant disk had previously failed.

Have you ever found yourself in a situation where you were told “they found the root cause?” Again, thinking of the example above. The hard drive failed – root cause, right?  It clearly failed. You can see the evidence.  Mechanical things break. Makes good sense…right?  What if you had a process that was able to provide more information? Leading up to the drive failure, outbound network traffic increased 250%. Database utilization also increased 400%. Disk input/output on that disk increased 1000%. The root cause, in this fictional example, may have just shifted from “bad mechanical disk” (no big deal) to “multi-system compromise, data exfiltration” and  “we are now glad the disk failure stopped the leakage.”

In security, we are often challenged to consider if a service outage was due to an unknown malicious actor, malware, a yet-to-be-found compromise or other “bad things.” As we become aware of recurring incidents that are either the same or strangely similar, we often can become more concerned. NTT Security has adopted a full security life cycle approach to security that looks at both overall security program maturity and the maturity of individual security controls; identifying foundational issues that may be the true root cause for security incidents.

Having a full security life cycle, effective, problem management process can provide dividends to the overall security management program and confidence to the business that a “true” root cause is found and not just enough to close the incident ticket.

For assistance with developing a problem management process, reach out to NTT Security to discuss your problem needs at us-info@nttsecurity.com or visit our global website at: https://www.nttsecurity.com.