This week we have seen a textbook example of a business continuity issue making the mainstream news.
NHS Greater Glasgow and Clyde had problems with its server which caused more than 700 patients’ appointments to be cancelled.
With staff unable to access records and scans, treatments such as chemotherapy were called off.
Staff were having to manually enter the records written over the days of the outage and tried their best to re-arrange appointments.
Robert Calderwood, the Chief Executive of the health board, confirmed on Thursday that the computer systems were back up and running after two days of trying to fix the problem.
As with all incidents there are a number of lessons we can learn.
1. Often we as business continuity people we look at loss of IT and work with our colleagues to develop disaster recovery plans. Often these plans are based around having two systems. In the case of data corruption this makes our disaster recovery strategy invalid as both our sets of data are corrupt, the main and the back-up. If we cannot fix the corruption then our only option is to roll back the systems to the last uncorrupt back-up and we may lose the data developed between now and the back-up used. We should discuss our IT department’s strategy for dealing with this type of incident.
2. We often, in looking at disaster recovery, tend to concentrate on servers and data centres. We don’t look at routers and the gateway into our systems. The loss of the Microsoft Active Directory prevented staff access to some systems even although the main systems are all working correctly. As part of our risk assessment we, or IT, should look again at how we access the systems. Are there single points of failure in the way we access and gain entry to our systems?
3. I heard an interview with a doctor, not from the Health Board but an associated organization. I was amazed at her naivety in terms of IT and how systems work. I got the impression that she thought in the night, magic pixies come out and can fix any IT system. I don’t think she is alone in not understanding the risks and capabilities of IT systems and the impacts if they fail. I believe it is one of the roles of the business continuity manager to educate senior managers in the threats to IT system. If they understand this then they can make appropriate investment in backup and disaster recovery and have a good understanding of the impact if the systems are lost.
4. In terms of communications with their interested parties, the board seem to have done well. I noticed there was a large box on the front of their website acknowledging the problem and routing users to a page which explained the issues and what they were doing about it. Robert Calderwood, NHSGGC Chief Executive apologised to all those whose appointments were cancelled. They used their twitter account @NHSGGC to keep people informed and to point them in direction of the website where with a more detailed explanation of what was happening. The @NHSGGC twitter hashtag is used on a day by day basis before the incident, which built it credibility as the voice of the organisation prior to the incident.
5. Lastly we mustn’t forget the impact on people of this incident. This was an IT issue but the consequence was people missing important operations. If you are going for an operation you may have waited months for it and have mentally prepared yourself and your family for the operation. Then to have the operation cancelled at the last minute, due to a computer error, must be very difficult to take. I am sure the organisation is doing its best to manage this but we as business continuity people should never forget the impact of incidents on the general public.