Six common causes of major software outages

Thu, 3rd Oct 2024

By Andrew Foot, Regional VP Sales for Australia and New Zealand, Dynatrace

Major software outages - such as the recent incident that affected millions of Windows-based devices around the world - are an ongoing threat in an increasingly digital world.

Outages can disrupt services, cause financial losses, and damage brand reputations. Understanding the causes of these outages is crucial for preventing them and ensuring smoother, more reliable operations.

Outages can occur for many reasons, ranging from internal mishaps to external attacks. They may stem from software bugs, cyberattacks, surges in demand, issues with backup processes, network problems, or human errors. Each of these factors can independently cause a major disruption, but often, outages result from a combination of issues.

Six of the most common causes of major outages (and what organisations can do to avoid them) are:

1. Software bugs:
Software bugs and bad code releases are common causes of major outages. These issues can arise from errors in the code, insufficient testing, or unforeseen interactions among software components.

Unfortunately, the complexity of modern software systems exacerbates the risk of outages. As applications become more interconnected, the potential for failures increases. A seemingly minor bug in one component can have far-reaching consequences, potentially bringing down entire systems or services.

To prevent outages caused by software bugs, organisations should implement thorough testing procedures, including automated testing and continuous integration practices.

2. Cyberattacks:
Cyberattacks involve malicious activities aimed at disrupting services, stealing data, or causing damage. These attacks can be orchestrated by hackers, cybercriminals, or even state actors.

The cyberthreat landscape is constantly evolving, with attackers developing increasingly sophisticated methods to exploit vulnerabilities. Ransomware and Remote Code Execution (RCE) are examples where malicious actors exploit vulnerabilities in systems.

Additionally, Distributed Denial of Service (DDoS) attacks, while not exploiting vulnerabilities directly, are malicious cyberattacks that can be highly disruptive to organisations.

To cope with the risk of cyberattacks, companies should implement robust security measures combining proactive preventive measures such as runtime vulnerability analytics with comprehensive application and perimeter protection through firewalls, intrusion detection systems, and regular security audits. Employee training in cybersecurity best practices and maintaining up-to-date software and systems are also crucial.

3. Spikes in demand:
Sudden spikes in demand can overwhelm systems that are not designed to handle such loads, leading to outages. This often occurs during major events, promotions, or unexpected surges in usage.

Real-world examples of demand-related outages are common and often high-profile. For instance, retail websites frequently crash during major sale events like Black Friday or Cyber Monday when a surge in traffic overwhelms their servers. Similarly, online streaming services have experienced downtime during the premieres of highly anticipated shows, as millions of eager viewers attempt to access the content simultaneously.

To manage high demand, companies should invest in scalable infrastructure, load-balancing, and load-scaling technologies. Conducting performance testing and having contingency plans for peak times can help ensure systems remain operational during spikes in usage.

4. Back-up failures:
Failures in the backup process can lead to outages, especially when primary systems fail and backups do not activate as expected. This can result from improperly configured backups, corrupted data, or insufficient testing.

The impact of backup failures can be particularly devastating as they often come to light during already critical situations. For instance, a healthcare provider might lose access to patient records during a primary system failure, only to find that their backup data is incomplete or corrupted.

It's critical to regularly perform backup and recovery tests to ensure that systems are properly configured. Companies should ensure they have a range of recovery options in place, including snapshots, replication, and backups, to provide a range of RTO and RPO options. A comprehensive DR plan with consistent testing is also critical to ensure that large recoveries work as expected.

5. Network issues:
Network issues encompass problems with internet service providers, routers, or other networking equipment. These can be caused by hardware failures, or configuration errors, or external factors like cable cuts.

The impact of network issues can range from minor inconveniences to severe operational disruptions. Slow internet speeds may hamper productivity, while complete outages can halt business operations entirely.

To mitigate network issues, organisations should ensure robust network monitoring and management practices. Redundant network paths and automated failover systems can help maintain connectivity during disruptions.

6. Human error:
Human error remains one of the leading causes of tech outages and can include mistakes made during routine maintenance, misconfigurations, or accidental deletions.

In high-pressure environments, even experienced professionals can make errors, especially when dealing with complex systems or tight deadlines.

Comprehensive training programs and strict change management protocols can help reduce human errors. Automated systems for routine tasks and thorough review processes for critical actions can also minimise the risk of mistakes.

Focus on causes and mitigation
Being aware of the main causes of large-scale tech outages is a good first step, however it must be followed by the development of strategies to prevent or mitigate them.

Organisations also need to put in place an observability platform that can provide a complete end-to-end view of all applications and services.

Putting these elements together can make an organisation's IT infrastructure more robust and reduce the likelihood that large-scale disruptions will occur.

Share on: