If you need to understand data centre risk management quickly, then look no further. We have distilled ten things you really need to know, from why do it, to where risks might occur and debunking common but erroneous mitigation strategies.
Risk management is not an optional extra in data centres. It is crucial. The sole purpose of a data centre is to keep information securely and safely, and allow continuing access to it. The consequences of getting it wrong are severe: data loss can have huge financial consequences to businesses, resulting in hours of wasted time, business continuity issues, and even potential legal action from customers who fear their data may have been compromised, not to mention serious reputational damage.
Risk arises from several possible sources, not all of them obvious. These include the requirement for a stable power supply, with back-up power available in the event of any problems, the sensitivity of IT equipment, a multi-disciplinary environment, with all the potential for communication and interoperability problems that implies, a 24/7/365 requirement, and very high expectations. Perhaps more than anything else, though, risk arises from gaps in understanding and misalignment between data centre and business strategy.
A regular and complete testing and maintenance schedule can save a lot of pain later. Your maintenance schedule needs to include not just your UPS systems, but also your back-up generator(s), fire suppressant systems, servers and storage. Most organisations know to test their UPS, but data from the Uptime Institute show that far fewer (less than half) test and maintain their storage systems.
Human error is the greatest cause of data centre failures, but training can help to mitigate the risk. Data from the Uptime Institute suggest that almost three quarters (73%) of data centre failure incidents are caused by human error, meaning that they are potentially avoidable. There is a 50% chance of error in tasks performed under pressure by staff who are unfamiliar with them. A familiar task carried out by a well-motivated and well-trained team has only a 0.4% chance of error. This means that the right people, with the right training and motivation, are vital.
You cannot provide a 100% SLA if you do not have the infrastructure to back it up. Yes, your customers may want that, but if you don’t have the infrastructure it is a) dishonest and b) foolhardy to agree to it. You will be caught out later by outages, loss of customer trust, and reputational damage from which you may struggle to recover. Expectation management and honesty are key to risk management in data centres.
The cloud is just ‘someone else’s data centre’ Moving data to the cloud does not help you to manage risk. All it does is move data to another data centre, about which you potentially know nothing, increasing the risk of problems, including serious reputational damage if anything goes wrong.
Blinding with science: the application of numbers. Numbers are often used to give a spurious accuracy to models, assurances, and risk management. But they actually guarantee nothing. A 99.9999% uptime guarantee still means 32 seconds of power outage every year, which could be crucial to a business that can’t afford that. Power usage effectiveness looks good, but you can reduce it by increasing power consumption. Beware of numbers and check their meaning.
Green is good, but it’s not the main function of a data centre. The purpose of a data centre is to provide reliable IT services. Green is good, but it’s an added extra, not the main function. Striving to be green can introduce new risks by reducing resilience. On the other hand, not being ‘green’ can make data centres the target of environmental campaigns and cause reputational damage. There is a fine line to walk.
Critical site management requires a focus on the high impact risks. Like any other form of risk management, data centre risk management is best focused on the most likely and/or most catastrophic risks. The financial and reputational impact should both be considered. This focus on the critical issues means that attention is not diluted by less important areas.
Good risk management means pre-emptive action, highlighting systematic issues, with a focus on learning and improving. Events and incidents need to be analysed for the root causes, and every opportunity used for learning and improvement, without a focus on blame. Good risk management also focuses on preventive action, pre-empting problems through constant checks, regular assessment of infrastructure, and practice of routine and emergency procedures.