Keep Calm and Respond - Practical Guide to Cyber Security Incident Response
The intention of this blog post is to provide an overview about the general concepts of Cyber Security Incident Management starting from its definition through an explanation on how does cyber security incident management fits into the domains of cyber security as well as what are the vital steps that should be followed during the handling of a cyber security incident.
For the sub steps of cyber incident handling, it highlights some typical pitfalls that should be avoided, how should the incident responders remain calm and keep the efficiency of the cyber security incident management.
NIST developed a framework called “Cyber Security Framework” which can be utilized to understand the bigger context of cyber security incident management. As per this framework, organizations need to build five major capabilities with regards to cyber security:
- Identify capability to know what should be protected and which level of protection is required in alignment with the relevant risks
- Protect capability to select and implement the necessary level of protection for the assets identified as valuable for the organization
- Detect capability to recognize if there is any cyber security incident impacting the protected assets
- Response capability to handle the detected cyber security incidents adequately
- Recover capability to restore business as usual operations after any kind of cyber security incident
Following this capability-based break-up of cyber security, we can easily define that cyber security incident management focuses on the establishment of the detection and response capabilities.
Therefore, the main scope and mission of incident management is to detect the cyber security incidents and provide response steps to mitigate their impact on the business.
Figure 1 - Scope of Cyber Security Incident Management
The cyber security incident handling capability needs to be established over the people, process, and technology layers. The technology layer enables the skilled staff to detect and respond to cyber security incidents via the pre-defined incident management process.
This blog post focuses on to introduce the details of the core detection and response processes and to highlight some technology elements that are enablers or can be leveraged to increase the efficiency of the cyber security incident response.
Figure 2 - Layers of Detection & Response
Handling of a cyber security incident has a well-defined choreography. NIST introduced a lifecycle model for cyber security incidents in the Computer Security Incident Handling Guide (that is referenced as NIST Special Publication 800-61).
According to that publication, the main stages of the lifecycle are:
- Detection & Analysis
- Containment & Eradication & Recovery
- Post-Incident Activity
Figure 3 - Incident Lifecycle
Using this lifecycle model, we can easily get to a conclusion that the very same steps have to be taken in case of any cyber security incident to drive it through its lifetime.
This article zooms into the Detection & Analysis and Containment Eradication & Recovery steps. Their sub steps is usually referenced as Incident Detection and Response Chain representing that for each and every cyber security incident the Continuous Monitoring, Detection, Triage, Analysis, Containment, Eradication & Remediation, Reporting as well as Communication & Coordination tasks should be completed.
Figure 4 - Incident Detection and Response Chain
Let us review the sub steps in a bit more detail.
To detect any harmful activity, continuous monitoring is required. It targets to have a real-time view and visibility on the assets that should be protected. Such visibility can be gained via cyber security relevant system log collection, endpoint information collection and network packet collection. Once the required data is collected, it must be aggregated, parsed, indexed, analyzed for identifying malicious behaviors or malicious patterns. Usually it is ensured via a SIEM tool on the technology layer.
The first and quite common pitfall raises already at this step. We might tend to think that more collected data results in more visibility without any drawback so let us collect even the universe. Unfortunately, it's really easy to overestimate the capabilities of the technology that is intended to process the amount of data collected and also our human capability to keep focused amongst tons of alerts and indicators of a potential malicious activity. So, we need to remain calm and focus on what is relevant. Having a risk-driven and business aligned approach obviously helps in this.
Once a potential incident is detected it must be validated whether the event is a real cyber security incident, or we just ran into a false positive alert. To keep the efficiency of the incident response, such validation must be done in a relatively short period of time, typically within 15 or 30 minutes.
Usually multiple factors should be considered during this validation such as where did the alert come from or are there any related known activities.
There are multiple pitfalls for this step too, so again, we need to remain calm. For example, treating every kind of network intrusion detection alert as a certain, sure incident would easily eat up all our resources. IDS signatures might easily trigger on binary or encrypted network traffic so high false positive rate should be expected for specific type of IDS alerts. The mentioned a lot WannaCry can be used as another example. As that particular ransomware was spreading through an unpatched SMB version 1 vulnerability, many organizations started to deploy alerting mechanisms for any SMB traffic which resulted in alert floods without having properly tuned, but at least it was good to recognize how widely SMB version 1 is still being used in enterprise networks.
To avoid such pitfalls, the detection content must be thoroughly tested and tuned to the monitored environment before enablement in production usage. This can be ensured via strict content management processes.
Once it is confirmed that we detected a real cyber security incident, the next step is to categorize it and assign a priority or severity that determines the timelines for the response tasks. The goal of the triage step is remarkably similar as it is being used at a traumatology in a hospital, it ensures that the cases are handled with the appropriate priority.
There are multiple concepts for defining cyber security incident categories, one of them is VERIS (which stands for Vocabulary for Event Recording and Incident Sharing). It distinguishes categories such as hacking, social, malware, misuse, physical, error or environmental.
Initial categorization is usually not a complex task, especially considering the fact that if the source of detection is a SIEM tool, then the alert which triggered is usually mapped to a given incident category even during the design phase.
The same applies to the ascertainment of the severity. If the incident was detected based on an alert or report, it will have an initially assigned severity score that might need to be adjusted based on the outcome of the investigation. In case the incident was detected from other sources (e.g. someone notified IT Helpdesk about a suspicious behavior or malfunction) and there is no "default severity" assigned, multiple methods exist for determining the priority/severity.
NIST's Computer Security Incident Handling Guide highlights some attributes that can be used for calculating the severity such as how the security attributes of the information asset are impacted, what is the potential impact on the business, etc.
Mis-categorized, or mis-prioritized incidents can cause harm to the business, so a calm attitude is essential during this step too. Just to stick to the previous example of network IDS alerts if an event is categorized as high severity by the IDS it does not necessarily mean that such severity should not be adjusted. Also, typical stressful point within a SOC or a team who performs cyber security incident handling, when a C level leader seems to be impacted by a potential incident. For example, if the performance of the CEO's laptop is degraded, it should not necessarily trigger a high severity cyber security incident investigation. Or there are significant differences between separate cases of the same incident category: an adware type of malware infection should not be treated with the same severity as a self-propagating information stealer or a ransomware.
At this stage, we reached to the point when we know we have a cyber security incident and we know how quickly we need to respond to it based on its priority. From this step our focus is to develop and execute a course of actions that aim to minimize the impact of the incident on the business operations.
The next two steps (analysis and containment) are usually done in parallel in multiple iterations.
Analysis aims to understand the details of the incident to be able to develop a response plan or to adjust an existing one.
As part of the analysis first we need to get all relevant information. For example, in case of a webserver breach, we need to get detailed application logs, OS level logs, network information from the surrounding network devices, etc. Even the collection might cause difficulties in some cases. If we need to examine a hard drive that is located on a remote site, then creating an image, booting it up in a virtual environment might take significant efforts and time. Preserving cold blood is important during this stage too. Continuing the previous example, if we need to examine a potentially malware infected device, shutting it down for creating a forensics copy might irreversibly destroy crucial evidences.
Assuming that the necessary information is collected, the irrelevant part should be filtered out: we should be able to dig through huge amount of data to keep only that which is related to the incident. After this we need to understand the root cause of the incident via restoring the causality chain step by step. Based on the root cause, an accurate incident response plan might be created on how the incident should be remediated.
Containment is about to isolate the incident meaning further business impact is prevented to gain time for the analysis. Forensics and malware analysis take significant time so proper containment strategy is important otherwise further damage would hit the business.
Containment is usually done in a form of an isolation of a service, an endpoint, or a network segment.
During this step we actively intervene to an environment, therefore remaining calm is critical again.
Arbitration immediately uncovers that we know about the attacker which might speed up his or her efforts to start covering tracks or quickly damage as much harm as possible. So, for certain types of incidents we should be prepared to redirect the attacker to a non-production, honeypot type of system to delay our disclosure.
On the other hand, isolation in some cases might cause more trouble than the real attack. For example if a production server gets infected by a malware that would try to harvest and leak information out to a command and control server and we confirm that the command and control connection still could not be established, we might not need to immediately shut down the device as it might serve a critical business process. Following the same example, if we have malware that successfully infected a server, but it does not propagate meaning it does not infect others automatically, invocation of a network separation can interrupt connections which again might impact critical business processes.
Eradication & Remediation:
Now we reached to those steps of the cyber security incident management, which are usually completed by the counterpart teams.
If we have a malware infected device, then final elimination of the incident, which is the complete rebuild of the asset is not done by the incident management team rather by the team who is in charge of the daily IT management. However, during these stages incident management teams must closely monitor the status as the situation might change, unforeseen issues might raise, or assistance might be required such as how to save data from a malware infected device in a safe manner.
As the incident management team is responsible for the coordination of the response actions, even during these steps internal or external obligations might apply for regulatory reporting. A common pitfall when the incident management team is "micro-managing" the ones who are working on the eradication and remediation. Too frequent information update requests might completely slow down the execution.
Reporting and Communication:
Besides the technical oriented steps that we covered so far, communication and reporting also must be managed from the point when we declare a cyber security incident until it is considered as resolved.
Reporting and enhancement actions such as performing lessons learnt is mostly considered to be the last lifecycle step during incident management. The outcome of these will provide crucial information for the preparation, how can we strengthen our protection, how can we do better on the process side, etc.
Communication ensures that all impacted parties (management, users, technical staff) are informed about the situation, potential impact on normal operations, or expected actions from their side.
Calm attitude during communication again takes a significant role for ensuring efficient response and of course it is crucial for preventing unintentional reputational loss.
The same applies to all forms and stages of the reporting. Improperly stating assumptions as facts based on incomplete investigation or unverified threat intelligence leads to misunderstanding, faulty decisions or other harm too.
How to remain calm?
So, after reviewing the general steps of incident management and a few examples of the damage that can be caused by hasty conduct, let us see what we can do to remain calm even during the most stressful moments, focusing on what can be done on the process and technology layers. The answer is obvious, we need to be prepared. The importance of the very first preparation step of the incident management lifecycle cannot be over-emphasized.
It must be ensured that the necessary oversight is obtained over the environment via implementing proper log, network packet, endpoint information collection and analysis even including the newest behavior analysis techniques. It is a crucial to interpret all relevant events within the protected environment. If incident handlers must focus on more events than necessary, it will obviously waste resources. If they do not see the relevant ones, they might miss the detection of an incident therefore failing to fulfill our mission.
Besides that, it also must be ensured that the incident management team gets to accurate conclusion upon analyzing the information with regards to the incident. The utilization of an advanced analysis technology such as behavior analysis aids in that.
Consistent incident response plans must be prepared in alignment with the relevant risk portfolio. One way to optimize the response capability is to try to minimize that time spent on the activities (perform containment, analysis as well as eradication and remediation as quickly as possible).
If response plans do not need to be defined in an ad hoc manner whenever an incident is declared, it certainly helps to remain calm. It is infeasible to prepare plans for every potential incident, however plans must be predefined at least for the most relevant ones per incident categories. Let us have incident plans defined for malware incidents considering cases which involves self-propagation or ransomwares. Or let us have plans for cyber security incidents originated from social engineering attacks as well as via hacking.
SOC or Incident Management teams are under continuous pressure due to the increasing number of incoming alerts that must be processed, shortage of skilled human resources, wide variety of tools that must be used during the incident management process. Utilizing Security Orchestration Automation and Response (SOAR) within the incident management helps to ensure consistent, standardized execution of the processes, minimize the response times spent on the incidents and automate manual tasks so the SOC team can focus on other value added tasks.
Training is equally important during the preparation. Response and even Detection processes should be tested regularly involving generic table-top exercises and more specific simulations (like adversary simulation, phishing tests, etc.).
Performing maturity assessments for the organization’s incident handling capability is an important method too as it shows where the SOC or Incident Management team is on its journey.
Last, but not least management aspects should not be forgotten that ensure the enablers are available for the SOC or Incident Management team. Just to mention a few, necessary authority should be defined for and granted to the them to be able to effectively perform the tasks, staffing has to be adequately planned and maintained, or budgeting should be tailored to the business objectives.