Chapter 8: Service Management and Disaster Planning – Fundamentals of Information Risk Management Auditing: An introduction for managers and auditors

CHAPTER 8: SERVICE MANAGEMENT AND DISASTER PLANNING

Introduction

Data and information are vital to the management and operation of modern organisations. To communicate, make market offerings and sales, receive income, obtain goods and services, pay staff, and report to stakeholders, regulators and shareholders, requires access to data and computer processing capability. Just consider the frustration when your PC does not work correctly in the office. The availability of IT services is now vital to the normal operation of all organisations. In order to maintain 24 hours a day, 365 day a year (24 x 365) operations, organisations need to ensure that they have arrangements for:

•   Monitoring their service availability and for reporting and escalating any incidents that occur (service management).

•   Dealing with any critical loss of IT or other business resources, as a result of a natural or man-made disaster incident, so that they can continue business operations.

In this chapter we will consider both the service management and business continuity aspects of availability of services and how they should be audited.

Service management overview

The area of service management has seen significant changes during my career. As part of my training in 1981 I spent three months on secondment to the IT department of a large local authority. Most of the computing was on a central IBM mainframe – the finance department had the only PC – a Commodore Pet, and this was kept in a locked room, access for which was only granted when you signed a log for the key. The mainframe itself was kept in a secure, highly controlled environment. All operations were centralised in IT so if the service was not available it was known to the IT department and they would fix accordingly. IT processes were scheduled at specific times (e.g. payroll run on Thursday mornings). Access to the mainframe was 8 am to 6 pm, Monday to Friday. Data input was performed by a central team based upon (paper) forms completed by the end user.

Nowadays:

•   Much of the data input, and some routine management, such as performing batch runs or running reports, is now performed by end users using automated processes. This includes the performance of some basic control processes (e.g. running additional backups).

•   PCs and other mobile devices are everywhere in organisations. They are vital to accessing information and resources people need to do their jobs.

•   Access is required around the clock, every day, in all sort of locations, not only in organisations and offices but also in coffee bars, on transport, in hotels, etc.

•   The loss of even an hour of access to IT resources can have a significant impact on user’s efficiency and effectiveness.

•   Mainframes have effectively disappeared – to be replaced by servers and even ‘the Cloud’.

The basic principles of service management still remain the same, organisations need:

•   Mechanisms to deliver IT services and resources including:

   Agreed levels of deliverable service management (e.g. the ability to provide new users with laptop devices and system access within, say, 24 hours)

   Ability to identify when resources are reaching capacity (e.g. email mailboxes or file storage are reaching limits)

   Availability management

   Service continuity management

   Ability to operate within the budget and resources available.

•   Arrangements to support the provision of IT services, including:

   Reporting, logging, investigating and resolving incidents

   Managing problems to ensure they are resolved effectively and in the right timeframe and that root causes are identified, to prevent recurrence of the same issue in the future

   Providing service desk support to minimise any disruption caused by faulty equipment or software

   Managing the release of new operating and application systems into the production environment

   Managing the configuration of the network and other IT assets and software

   Management of changes made to the network and operating environment.

Risks

The main risks associated with service management are the loss of availability of IT resources required for business operations – leading to lost income and increased costs. In some industries there may also be regulatory or public image consequences of such losses.

Table 9: Service management risks

Risk

COBIT 5 Process ID

COBIT 5 Process

IT services may not be delivered as required/planned. This could lead to inappropriate use of limited IT resources.

DSS01

Manage Operations

Failure to promptly and fully resolve requests for resources or to resolve incidents, including security events, could result in loss of user productivity causing an impact on business operations.

DSS02

Manage Service Requests and Incidents

Failure to prevent or resolve issues or problems

DSS03

Manage Problems

Failure to prevent or resolve problems could disrupt and restrict access to resources and data causing customer and user inconvenience and dissatisfaction.

DSS04

Manage Continuity

Controls

Organisations need to have policies, service level agreements, and procedures and guidelines for their process for service management. Many of the organisations I have been involved with use the ITIL framework as a basis for their own framework. ITIL (Information Technology Information Library) is currently provided by AXELOS – a collaboration between UK Government and a large IT service provider. The ITIL best practices are currently detailed within five core publications:

1.  ITIL Service Strategy

2.  ITIL Service Design

3.  ITIL Service Transition

4.  ITIL Service Operation

5.  ITIL Continual Service Improvement.

ITIL also provide certification, training, tools and the exchange of ideas. It covers all of the issues and risks identified above.

Synthetic monitoring/testing

Websites, both internal and external, are used for many mission critical aspects of business, including obtaining and recording sales, and providing access to online services. Organisations need to know that these services are operating effectively all of the time and to be immediately aware if the service is not accessible. There are a number of tools and services available to provide active monitoring (aka synthetic monitoring) of URL web addresses to ensure this, and often to run standard transactions on screens within the site. Pre-prepared scripts are used to ping the web address at specified regular intervals and confirm that they can be accessed and to monitor response times. This enables the web master or manager to monitor performance and respond to any downtime – wither via a dashboard screen, or reports, or SMS messages.

Documenting, assessing and testing availability controls

To review service management, I would consider the following:

Table 10: Service management checklists

Area

Questions and tests

Review of policies, procedures and guidelines

Are they comprehensive? Up to date? Applied in practice? How are they updated and disseminated?

Service management process

Review documents of process and also obtain evidence as to whether this is applied in practice. Ascertain whether there are any exceptions.

Service levels

Review service management monitoring reports and information. In particular, look for evidence of review by management and action being taken to resolve any issues. Include any synthetic, capacity management or similar automated testing.

Disaster planning

Introduction

A good many years ago I made a recommendation on an audit report that a medium sized local authority, highly dependent on its IT, should have a business continuity plan. They asked me how much they should spend on it – and I answered that like all insurance arrangements it was their call. I said that if they had a disaster they would wish they had spent more and that if they didn’t have a disaster they would wish they had spent less. They introduced a plan – a few months later they had a fire in the computer room (nothing to do with me honest!) – as a result they had to implement their plan. At that stage they wished they had invested more resources into the plan – it was effective but had some gaps which proved expensive to resolve at the same time as dealing with the incident.

Following the Y2K scare and 911, most organisations now take business continuity and disaster recovery planning more seriously. They know that they need to have resilience to be able to deal with a range of occurrences – from the loss of a single PC or website, right up to the loss of a whole location or network. There are numerous examples to show the benefit of preparation, planning, testing and training in resolving these issues and being able to continue to do business They also know that it’s not just about IT – you can have the best arrangements in the world with state of the art recovery centres and the latest equipment – but if you do not have the communications infrastructure, people and processes to go with it, you will still be unable to continue to do business with your customers. Increasingly, in the information age, it is also necessary to deal effectively with the media and be able to communicate clearly with the public, and sometimes with the family or friends of any casualties caused by the incident.

Although organisations are much more prepared now than they used to be, the problem has become more complex:

•   The level of connectivity that users require is much higher. For example, the loss of Internet, email and mobile apps can critically impact business effectiveness very quickly.

•   The number of different applications and services that are used to achieve just one thing (such as a mobile CRM app) is high, so there are many more points of failure (rather than just a fire in the data centre).

•   More IT is now developed and/or contracted outside the IT organisation and so may not be included in the plans, sometimes with a number of different providers that may need to be involved if there is an incident.

Effective business continuity ensures that organisations are ready. To quote ISO22301:2012: “Business Continuity is the capability of the organisation to continue delivery of products or services at acceptable predefined levels following a disruptive incident”.

ISO22301:2012 specifies requirements to plan, establish, implement, operate, monitor, review, maintain and continually improve a documented management system to protect against, reduce the likelihood of occurrence, prepare for, respond to, and recover from disruptive incidents when they arise. ISO22301:2012 is applicable to all organisations; however, how it is applied can be flexed, depending on your operating environment and complexity.

Risks

For a number of years there have been two statistics quoted:

•   80% of businesses suffering a major disaster go out of business in three years.

•   40% of businesses that suffer critical IT failure go out of business within a year.

Like a number of other researchers, I have been unable to trace the source of these quotes – they may just be myths. However, the fact remains that most businesses will do what they can to prevent a disaster occurring. If it does occur, they will recover faster if they have a plan that is effective and tested. Even if the business continues, the incident will impact normal business operations. So the main risks could be summarised as:

Business failure or major disruption as a result of:

•   Not having an effective plan, supported by appropriate arrangements.

•   The plan being out of date, for example, not taking account of new business operations or forms of IT provision.

•   The plan not having been thoroughly tested to ensure that it is effective and understood by all staff impacted.

Controls

The only real control is to have a clear plan which meets the incidents you are likely to face, has been designed in accordance with your business needs and requirements, and is effectively implemented and tested. The Business Continuity Institute (BCI) refers to this as embedding business continuity (see www.thebci.org/):

Case study examples

My favourite case study is one in which I had no involvement – other than as a Hampshire rate payer, when the County Council were successfully prosecuted for decisions/actions taken by the local fire service!

In March 1990 the fairly new headquarters of Digital Computing in Basingstoke suffered a fire which destroyed the building and millions of pounds worth of computer equipment. As would be expected for a major computing company, there was a good disaster recovery plan including backed up data, alternate sites, etc. and business operations were quickly resumed. However, even in this case there were lessons to be learnt. Most disaster recovery plans assumed at that time that incidents occurred whilst the building was empty – for example in the middle of the night. In this case it was during normal office hours – with over 300 staff inside the building. I believe that no-one was seriously injured.

I remember seeing a video of the incident developing (clips are still available online) and hearing a talk from the disaster recovery manager after the event. The film starts with people in the car park – looking at the office – just a normal fire alarm drill. Then someone spots the fire in the top corner of the building – it quickly spreads and people are moved to a safer distance. Then they realise their jackets and handbags, car and house keys are inside. Worse still, the falling building and fire destroyed dozens of their cars. The plan was extensive and well tested – as a result the disaster recovery team were able to adapt to the unexpected elements of the incident – mainly to assist staff (locksmiths were hired to help them get into their homes, arrangements for car hire, etc.). This was only possible because of good preparation.

Documenting, assessing and testing availability controls

An audit or assurance review should include the risks shown in Table 11.

Table 11: Availability risks

Risk

COBIT 5 Practice ID

COBIT 5 Practice Name

The plan may not be comprehensive and may miss:

outsourced or off-shored services;

legal obligations;

key stakeholders;

critical systems, processes or staff;

minimum service levels to be achieved

DSS04.01

Define the business continuity policy, objectives and scope.

The strategy behind the plan may not consider:

all potential scenarios;

the full potential impact of a disruption; expected recovery times;

different recovery options;

the RACI to be followed

DSS04.02

Maintain a continuity strategy.

The proposed response may not be fully defined including:

skills/roles and responsibilities, critical business processes/procedures to be followed; contact details for suppliers and partners;

resumption arrangements

backup requirements;

distribution of plans and supporting documents

DSS04.03

Develop and implement a business continuity response.

The validation plan needs to include;

objectives for exercising and testing the business,

realistic stakeholder exercises,

roles and responsibilities

schedules, post exercise debrief, arrangements for updating plan after validation

DSS04.04

Exercise, test and review the BCP.

Regular review of plan

revisions to BIA

communication and approval of proposed changes

DSS04.05

Review, maintain and improve the continuity plan.

Training and awareness

DSS04.06

Conduct continuity plan training.

Management of backup arrangements (including third parties);

•   Frequency (monthly, weekly, daily, etc.)

•   Mode of backup (e.g. disk mirroring for real-time backups vs. DVD-ROM for long-term retention)

•   Type of backup (e.g. full vs. incremental)

•   Type of media

•   Automated online backups

•   Data types (e.g. voice, optical)

•   Creation of logs

•   Critical end-user computing data (e.g. spreadsheets)

•   Physical and logical location of data sources

•   Security and access rights

•   Encryption

•   On and off-site storage

DSS04.07

Manage backup arrangements.

Assess adherence and effectiveness and approval of changes/change management

DSS04.08

Conduct post-resumption review.

Summary

To ensure availability of IT resources and data requires good service management to reduce the risk. However, contingency arrangements are still required in case these prove not to be adequate.