22 Tips to Speed Up Mean Time to Remediate (MTTR) in the Cloud

22 Tips to Speed Up Mean Time to Remediate (MTTR) in the Cloud

In the cloud, problems can crop up quickly, often out of your control. If your cloud environment has been compromised, you need to move fast. The longer a threat lingers and spreads, the greater the risk – including the damage it creates. Reducing mean time to remediate/respond/recover (MTTR) isn’t just about speed; it’s about business resiliency, which includes smarter automation, streamlined processes, and reducing inefficiencies. We asked 30 of your professional colleagues for their best and swiftest options, and their insights reveal a story of what truly works for reducing MTTR. Here’s their advice.

Got feedback? Join the conversation on LinkedIn.

Huge thanks to our sponsor, Palo Alto Networks

Cortex Cloud, the next generation of Prisma Cloud, merges best-in-class CDR with industry-leading CNAPP for real-time cloud security. Harness the power of AI and automation to prioritize risks with runtime context, enable remediation at scale, and stop attacks as they occur. Bring together your cloud and SOC on the unified Cortex platform to transform end-to-end operations. Experience the future of real-time cloud security at https://www.paloaltonetworks.com/cortex/cloud.

1. Start with the best ingredients – your data

You need top-quality data for top-quality results. “Transparency drives behavior,” said Grant Anthony, CISO, Orion Health. “Use rock-solid data and metrics to promote transparency and healthy competition among teams.”

2. Identify your targets and understand how they’re connected

Once you feel you’re collecting data correctly, look for connections between your assets, crown jewels, processes, and identities.

“Establish a policy or guidance on the SLA (service level agreement) to determine the target MTTR, then work towards that target,” said Greg McCord, CISO, Lightcast. “Start with easy targets first, improve underlying processes, and move the target closer to requirements.”

Before something bad happens, clearly understand the order of importance. What needs to be dealt with first?

“Classify your cyber assets and tie your cyber assets to identities. We don’t need to focus on MTTR for everything. Identify business-critical assets and processes and prioritize protection/prevention/preparedness (P) and remediation/response/recovery (R) for those,” suggested Yabing Wang, vp, CISO and CIO, Justworks.

3. Keep things in context – see the forest as well as the trees

The reason for context is to know how it works in your environment. Context helps ensure that individual solutions do not end up obstructing overall progress.

“Tie the security efforts to the company’s business goals,” said Tomer Gershoni, former CSO, ZoomInfo. “Simplify the vulnerabilities information including the business impact and precise remediation guidelines and build an agreed KPI that could push towards better accountability of engineering leadership.”

“Investing in context building helps your team optimize each of these strategies to move faster. Developing application and asset context, architecture diagrams, threat models, and incorporating the related asset meta-data into your automated workflows allow your responders to remove manual steps and make critical decisions faster,” said Richard Marcus, vp, InfoSec, AuditBoard.

4. Keep your solutions simple, too

Everyone wants to simplify security. If you have more moving parts and more places to look, you’re just adding time to your MTTR.

“A centralized logging/monitoring solution across the cloud landscape will minimize the alert fatigue that is all too real in our industry. Instead of various tools sending emails, creating tickets, and sending alerts to an administrator console, have one location where all logging and monitoring results are funneled. From there, implement your playbooks,” suggested Nick Ryan, CISO, RSM LLP.

“Teams from development, cloud architecture, and security operations often operate independently, using disconnected tools, with different risk models. Fast (near real-time) remediation and response means stitching all relevant data together so that all teams are drawing from the same interconnected context,” added Elad Koren, vp, product management, Cortex Cloud (Palo Alto Networks)

5. Write, run, and revise playbooks

“When it comes to the data and application layers of a complex organization, there needs to be a playbook for what is brought back to service and in what order,” advised Jack Kufahl, CISO, Michigan Medicine.

“Automate playbooks (where possible), detail business continuity plans and incident response processes that are both heavily documented and constantly practiced,” said Shaun Marion, vp, CSO, Xcel Energy.

These playbooks need to be detailed without being confusing, and they should/must be something that can be run. Test them.

“Regularly drill these playbooks, lean on them during live incidents, and continually refine them,” added Marion.

“For runbooks, we had to develop quality, realistic scenarios and what the automated action would be based on the presented symptoms. These can then be given to new cloud security providers to have them implement them on your behalf with their product,” said RSM LLP’s Ryan.

Kufahl also suggested “improving the connection between operational units and the IT recovery teams to confirm what order to recover applications so there can be an effective return to business capabilities.”

These playbooks can be highly targeted. “Implement verification workflows/runbooks designed to rapidly collect and analyze contextual information for quick false positive identification. This approach frontloads the investigation work that analysts typically perform manually,” said Mathew Biby, CISO, Agile Security Group LLC.

6. If you did your documentation right, no one should have to ask a question

If someone must ask a question about a process in the heat of an incident, you’re slowing down the ability to speed up MTTR.

“The person working the ticket should not have to ask a single question or look at more than one source to know what is correct and what the purpose of every service and config is doing,” said Howard Holton, COO, GigaOm.

7. Plan your practice and practice your plan

“Every portion of your incident response plan should be practiced until it is smooth and second nature to those involved,” said Adam Arellano, field CTO, Traceable.

Practice is not a one-time thing, adds Xcel Energy’s Marion. “A quarterly tabletop exercise is akin to flossing your teeth only before going to the dentist; if you aren’t doing it regularly, it’s ineffective. When an incident occurs, you want your team responding with muscle memory, not making it up on the fly. Continuous practice ensures that your team is prepared and can react swiftly and effectively.”

Practice not only shows the “how” of recovery but also the “how long.” Most organizations have no idea how long it will take to restore data from a backup.

“Conducting regular test exercises is essential and ensures that your team not only has the skills and technology to recover effectively but also understands how long recovery will take,” offered Dennis Pickett, vp, CISO, Westat.

Practice is a safe place to make mistakes. Learning happens by observing mistakes.

“Leverage root cause analysis reports and past incident data,” said Russ Ayres, deputy CISO, head of cyber, Equifax. “They can highlight gaps in process flows, technical barriers, skill set deficiencies, and even pinpoint teams or individuals who excel at resolving issues quickly.”

8. Automate what can be automated

“Too often teams focus on detection but still rely on manual intervention to fix issues, which slows everything down,” stated Marcos Marrero, CISO, H.I.G. Capital.

“Regimented, robust, and mature automations that handle standardized processes with little to no human interactions can dramatically reduce MTTR,” said Ken Athanasiou, CISO, VF Corporation.

“Automate anything you do for a second time,” advised Robb Reck, chief trust and security officer, Pax8.

“Use real-time event-driven architecture to automatically trigger remediation workflows within seconds of detection,” suggested Aamir Niazi, executive director/CISO, SMBC Capital Markets.

“Automation requires some up front investments in RPA (Robotic Process Automation) and SOAR (Security Orchestration, Automation, and Response) along with maintenance and tuning on a consistent on-going basis, and it can be very bumpy when you start out if your environment isn’t very mature, but the juice is definitely worth the squeeze,” said Athanasiou.

“Because of how fast cloud native environments change and how decentralized they are, automating as much as you can is the key to reducing MTTR, but this is only possible with a unified data model driving AI/analytics,” added Palo Alto Networks’ Koren.

“Not only does automation drive down MTTR, but it also increases employee satisfaction,” said Reck.

9. Contain the damage before it becomes a problem

Automation also adds a significant security benefit.

“Automated workflow isolates the compromised VM before lateral movement happens,” explained SMBC Capital Markets’ Niazi.

“Automating common responses,” said H.I.G. Capital’s Marrero, “like revoking excessive IAM permissions or isolating compromised resources, can immediately contain threats, removing human bottlenecks and notifying security teams for review.”

“Use policy-as-code and SOAR for immediate containment,” said SMBC Capital Markets’ Niazi, adding, “If a public S3 bucket is detected, auto-restrict permissions and notify SecOps before an attacker exploits it.”

10. Tie automation to threat intelligence to your cloud response plan

Automation then leads to increased and improved threat intelligence.

“Integrate cloud-native threat intelligence feeds to detect and auto-prioritize high-risk threats in place of SIEM-based alerts, which lag behind real-time threats,” said SMBC Capital Markets’ Niazi.

“Cross-correlate those threat feeds with your own telemetry and use snapshot and forensic cloning for fast incident response,” Niazi added, pointing out that “one of the biggest timewasters in cloud IR is losing volatile evidence before an investigation starts. Automate forensic snapshotting to capture a system’s exact state before remediation kicks in.”

11. Keep your experts on speed dial

One key reason to hold tabletop exercises is to determine who to call at each point in a given scenario.

“Ensure you have a readily available and up-to-date list of the right subject matter experts for each scenario,” said Equifax’s Ayres. “The person responsible for an application or infrastructure component may not always be the most qualified to restore it quickly and safely.”

12. Test so you’re never surprised

Testing is more than just practicing. It’s knowing that what you think will happen when you try to do an action, like restoring from a backup, will indeed happen, and you’ll know how long it’ll take.

“Having an effective and regularly tested recovery/restoration program is the single best way to ensure low MTTR metrics,” offered Edwin Covert, head of cyber risk engineering, Bowhead Specialty.

“Test your ability to ‘recover completely from nothing’ regularly,” added Jim Bowie, CISO, Tampa General Hospital. “The number of applications you think you have a backup, route, or high-availability (HA) connection to correct but don’t will surprise you.”

13. Pursue a gold standard in standardizing

“My favorite tip to reduce MTTR is to standardize one’s operations,” said David Emerson, CIO, SolCyber.

“Understanding your attack surface is essential, whether you’re operating in single-cloud, multi-cloud, hybrid, or transitional on-prem-to-cloud environments. Choosing the right Cloud Security Posture Management (CSPM) solution streamlines visibility and accelerates your ability to detect, respond, and remediate threats effectively,” suggested Ty Sbano, CISO, Vercel.

The purpose of standardization is so that everyone knows what to expect, said SolCyber’s Emerson: “When a heavily automated and orchestrated environment encounters an incident, it is easier to troubleshoot, faster to ascertain the scope, faster to recover from, and easier to test response, remediation, and recovery. Reducing the influences of uncertainty and inconsistency will drastically and naturally reduce your MTTR.”

14. Once standardized, it’s time to refine

Once you have your systems in place, don’t count on anything working out of the box. You need to refine your environment.

“By closely monitoring program performance and fine-tuning our detection mechanisms, we ensure that every alert warrants immediate attention, optimizing our team’s efforts and resources,” added Sivan Tehila, CEO and founder, Onyxia Cyber.

15. You’ll speed up response time if you understand the true cause

“MTTR is not just about reacting quickly but about getting to the true root of the issue,” said Kayla Underkoffler, lead security engineer, office of the CTO, Zenity.

Xcel Energy’s Marion agreed: “The ‘boil-the-ocean’ strategy typified by CASB (Cloud Access Security Broker) diluted our ability to effectively monitor and respond to what matters. Focus your monitoring, controls, and capabilities around your core cloud services that house sensitive information or are critical to your core business operations.”

Underkoffler saw this pattern with other organizations as well: “Security teams rush to analyze inputs and outputs to try and identify possible issues. In doing so, they miss the bigger picture of what happens underneath the hood…ad hoc approaches may shorten time to response in the short-term but will likely result in alert overloads that can slow down teams in the long run.”

16. To dramatically improve MTTR, you’ll need to overhaul development

“Expecting a response time measured in hours in the face of a crisis is naïve if your typical change rate is measured in days or weeks,” said Russell Spitler, co-founder, CEO, Nudge Security.

Your bottleneck may be the speed at which your development team can make updates.

“When looking to improve MTTR, the focus should not be on security processes but development processes, as in, ‘what can you do to help automate or assist the development process to get to the point where daily or hourly updates to the environment are feasible,’” asked Spitler.

17. Automation is wonderful, but people drive the effort to reduce MTTR

Successfully reducing MTTR is very people-intensive.

“Collaborate with your partners in the organization to share and discuss targets. The relationship aspect is incredibly important to help deliver your target requirements,” said Lightcast’s McCord.

“Simply knowing which assets and processes are critical isn’t enough. By mapping who has access to them, we gain insight into potential broader impacts (as the same identity may have access to multiple critical assets and processes). This also helps us determine who can remediate and how,” added Justworks’ Wang.

18. Often, the problem with slow MTTR is taking too long to detect

“What’s not often considered is that many security alerts sit around for hours in a queue before someone looks at them. If you want to reduce your MTTR, you should start by addressing MTTA, or mean time to acknowledge,” stated Edward Wu, CEO and founder, Dropzone AI.

Davi Ottenheimer, vp, digital trust and ethics, Inrupt, is a fan of goose tokens “to alert you immediately when accessed, giving you early warning of potential intrusions or misconfigurations that might expose sensitive resources.”

Some people call these canary tokens, but he pointed out that it’s the goose that sounds the alarm, rather than dying quietly. “Put your ’honking’ tokens all over your cloud,” he added.

Then, start connecting your services so that these alerts can be seen where they’re needed most.

“Correlating identity provider (IDP) logs with logs from critical SaaS applications and cloud services significantly accelerates detection and response capabilities, allowing Tier 2/3 analysts or threat hunters to immediately begin investigations with the insights needed,” added Adam Koblentz, field CTO, RevealSecurity.

19. Renew rather than patch in perpetuity

“Rip and replace” or any version of it may make you much happier in the long run.

“Sometimes you have to give up rather than struggling to keep a plant alive forever,” said Inrupt’s Ottenheimer. “Don’t patch servers through the winter when you can replace them fresh in the spring.”

20. It’s always DNS, isn’t it?

“As every system and network engineer knows, ‘It’s always DNS,’” said Bozidar Spirovski, CISO, Blue Dot. “DNS may be a part of the cloud platform and seems like it’s a transparent thing. But whether it’s about responding to a security incident or running disaster recovery, having a good control and understanding of how everything within your cloud environment uses and resolves DNS is crucial. Establish DNS control exercises to get a deep understanding of your DNS resolution, forwarders, servers, and namespace used by the services as well as a clear map of who controls the DNS servers and can work with you to reconfigure.”

21. Give IaC some love

It’s always nice to know you can always go back home.

“Having your cloud infrastructure as code (IaC) backed up in a text file allows you to restore your architecture to a previously known good state with just a few clicks. This method can significantly reduce recovery time. Decouple your data from your systems and ensure it is backed up separately. One of the key benefits of cloud environments is flexibility, and leveraging IaC is an excellent way to enhance that advantage,” suggested Westat’s Pickett.

“IaC means rapid fixes can operate across environments simultaneously. Make some IaC templates to reduce or prevent misconfigurations,” added Inrupt’s Ottenheimer.

22. Make sure your cloud service is set for success

The most obvious starting point is to work with your cloud service provider as to what you can do to get back to business as quickly as possible.

“Build gold images that are hardened and patched before any deployment and then create deployable AMI’s (Amazon Machine Images). Position all systems, including single servers behind ALB’s (Application Load Balancers) with autoscaling policies and security groups,” said Jesse Webb, CISO and svp, information systems, Avalon Healthcare Solutions.

Conclusion: Don’t wait for the sky to fall

These tips for improving MTTR revolve around awareness and preparation. Some are people-focused and others are technology-focused. The greatest challenge for CISOs, though, comes from convincing others of the need to prepare in advance. Humans don’t like to think about all the bad things that can happen and will prefer to look to the CISO as the agent of response rather than anticipation. But the ability to communicate preparedness is not a Chicken Little scenario. As Tomer Gershoni stated, “MTTR can have an impact on the company’s performance, customer satisfaction, NPS (Net Promoter Score) and more dimensions which are way beyond just security.” It needs its place at the table now.

Steve Prentice
Author, speaker, expert in the area where people and technology crash into each other, viewed from the organizational psychology perspective. Host of many podcasts, voice actor and narrator for corporate media and audiobooks. Ghost-writer for busy executives.