What do fat fingers, power failures, and Godzilla have in common? Answer: They have all been responsible for IT outages that, even if they aren’t caused by Godzilla, can cause monster-sized budget holes for companies. How monster-sized?
A Rand Organization report says that according to 98% of organizations, IT outages cost them a whopping $100,000 an hour. And those outages are practically guaranteed, according to a University of Chicago study.
It’s not just about direct losses due to outages, either. If a company can’t maintain its services, it may lose customers, and its reputation in the industry will suffer. Outages are often a hot topic on social media, and those messages tend to hang around for a long time. With numbers like those, companies of course prepare as well as they can for the inevitable.
Thus companies invest heavily in redundancy and in High Availability (HA) technologies. They have a continuous availability and resiliency architecture and systems in place to ensure that they can restore services ASAP in case of an outage and make sure they meet 100% uptime. Despite those investments, we still see massive outages that affect millions of people, lasting for hours at a time. One would think that at least one of these heavily-invested plans would do the trick – that the resilience plan organizations have would enable them to quickly resolve the issue, or failing that, they would be able to get back online immediately thanks to their redundancy plans.
Yet we see that is not the case; despite their seeming readiness, companies are subjected to ongoing and repeated outages. Clearly the plans that companies have implemented are not up to the challenges in IT that cause outages. Here are some reasons why that is:
Complexity – Hybrid / multi cloud environments are the new normal. Multi-layered, complex and dynamic environments that entail local and remote systems feature complex dependencies and are difficult to track down. Discovering the glitch that would cause an outage is extremely difficult, given the many points of possible failure.
Ongoing changes – New improvements, innovative features, capabilities and services are being introduced at a pace we’ve never seen before. Companies need to integrate these new services into existing IT infrastructures, and those changes always have an impact on the existing IT environment. It’s almost impossible to know in advance what the impact of these new additions will be, contributing to the likelihood of outages.
Knowledge gap – Those new improvements, features, and services bring with them a raft of new best practices that need to be implemented in order to integrate them with existing IT systems. That’s in addition to the already existing best practices and rules that IT systems are subject to – and are constantly being updated as new versions come out. The odds that a best practice for one system will be a non-starter for another system grows exponentially.
Insufficient controls – IT systems are constantly being updated, and new systems are introduced on a regular basis, leading to a situation that makes it difficult to keep up with changes and implementing best practices properly. Ideally, any changes should be tested before they are implemented in a working system, but the pace of work and the need to keep up with changes doesn’t always give organizations the time they need for that. Thus, changes are implemented without sufficient controls – increasing the risk that something could go wrong.
In addition, systems are often automatically updated by vendors, with changes made to configuration files, dependencies, etc. Those changes – which IT teams often don’t even find out about until they are implemented – could be responsible for outages. Although they may improve the performance of individual systems, those changes could be responsible for an overall problem.
All these contribute to the possibility of outages, and make it difficult for IT teams to even build effective resiliency plans. And even if redundancy plans are implemented, they don’t solve the reasons for the outage in the first place – which, if not dealt with, is likely to repeat itself. And given the formidable challenges IT teams face, that likelihood is very high.
Is there any way of reducing that likelihood? The first thing that IT teams need to do is realize that keeping track of all the configuration changes and points of potential failure is impossible. There is just too much data to keep track of, too many data points to manage and it keeps changing.
The ideal suggestion to deal with this is to inspect those IT configurations, settings, and dependencies, and make sure they are set up according to vendor resilience best practices. Changes in the configuration map could indicate that a problem is in the offing, so an alert issued in the wake of a change could allow IT teams to intervene in time to prevent a major problem.
With the complicated IT systems that companies use today, the risks of outages are indeed high – and the costs can be huge. Companies need to do everything they can to avoid outages – and ensure that they don’t have to pay those costs.