Even a *HOT* backup is not a Business Continuity Strategy   Leave a comment

Over in the PACSGroup UK forum there is a discussion thread that has been quite busy the last week or so.  The overall topic is the specification of the notion of ‘Vendor Neutral Archive’ – in particular, at least as I read it, what problems VNA are supposed to solve.  Lest I be misinterpreted at any point – I am a big supporter of any disruption in the Radiology/PACS market that empowers the end user to a greater extent than they have been in the past.  That disruption has come in the shape of VNAs over the last couple of years and I tip my cap.

However, there is one suggestion regarding the value of VNAs that makes me pause for thought.  That is that VNA can be used as a Business Continuity facility to mitigate against the risk of PACS becoming unavailable.  That just in its own right is a limited view of how PACS BC can be approached.  A VNA has the capability of maintaining an independent copy of images – which can be used as the distribution hub for clinical locations outside Radiology.  Lets assume for the moment that the viewing capabilities outside of Radiology are themselves independent of PACS (with the state of affairs as well as many procurements under way even as I speak), the fact is (elephant in the room?) if PACS is down – RADIOLOGY is down. That is NOT providing any kind of continuity of service at all.

And, in my mind – it is a profoundly oversimplified perspective of what BC is and the challenges that face it, as a process.

The foundation of business continuity are the standards, program development, and supporting policies; guidelines, and procedures needed to ensure a firm to continue without stoppage, irrespective of the adverse circumstances or events. All system design, implementation, support, and maintenance must be based on this foundation in order to have any hope of achieving business continuity

Wikipedia: Business Continuity
Emphases mine.

It is important to realise that Business Continuity measures cannot be easily plugged-into existing systems and in a moment I’ll give a couple of examples where BC processes can be hampered by decisions made in the delivery of basic infrastructure – quick possibly many years prior to the procurement of a given system.

Business continuity is sometimes confused with disaster recovery, but they are separate entities. Disaster recovery is a small subset of business continuity.

DR is where the old-fashioned off-site backups come in.  Lets say you suffer the nightmare scenario – you have a bunch of servers running in a server room which is consumed in fire.  DR is then about recovering to a ‘normal’ state of affairs.  You may need to provision a new server room.  You’ll probably need to order up a bunch of hardware in a hurry (some thoughts on Vendor Relationship Management to follow!) and mobilise your software vendors to help restore from whatever backup strategy you have in place – which – yes – can include a VNA (as long as it was on a seperate site).  Depending on the number of systems involved and the amount of preparation, that process is likely to be measured on a timescale of days.  Clearly, DR in this extreme sense, is NOT about continuity of business process.

Where BC comes in in this case – is in protecting the viability and integrity of core business processes while the DR process is ongoing.  And BC must consider a whole gamut of potential risks other than the ‘nightmare scenario’.  Some of these risks can be identified, analysed, and mitigated.  Some that are identified may simply be carried, and there will always be the scenario that comes out of left field.  One the the core messages in the superb book “The Back Swan” by Nassim Nicholas Taleb is that the risks we identify should never do us significant damage – precisely because we saw it coming and can prepare in advance.  It is the disasters and events we don’t see coming (like the existence of black swans) that will have the biggest effect – because we didn’t.  It is the job of the Business Continuity Process to provide the flexibility and structures to deal with the back – as well as the white – swans.

A couple of examples that are reasonable extrapolations of real events that have occurred in any number of sites around the world:

Example 1:

Lets say one has a server room with a number of servers – most of which have disk storage served from a SAN.  The hardware for the servers, the SAN and the network has been carefully specified and implemented to eliminate any Single Point of Failures (SPoF) with multiple layers of redundancy, integrity checking and clustering.  Fire suppression is state-of-the-art.  What can go wrong?  The air conditioning fails, and will take 3 hours to fix.

With the heat generated by the various kit, the room temperature will go out-of-spec in 1 hour.  All servers must then be shut down until the AC is fixed and the environment restored.  It is possible to identify a subset of servers that simply MUST be kept online.  Can the remainder be shut down for the duration while the handful stay up – yes they can – but it doesn’t make any difference.  By far the biggest generator of heat is the SAN.  The SAN is either up – or it isn’t.  If it isn’t – all servers are down anyway.

Lesson: Its really easy to miss a risk at an early point in an overall infrastructure deployment that turns out to have a much bigger impact when facilities have grown.

How does BC handle this?  Probably by fallback to a manual process.  Does that mean special stationary needs to be held in stock? Does everybody involved know how to fallback to manual?  Will you need a team of facilitators mobilised to ease that process?  It WILL have an impact on the performance of involved departments – does there need to be an escalation plan to limit service demand by (for example) diverting emergency patients to alternative facilities? This is Business Continuity Planning, and requires a holistic approach.

Example 2:

Lets say you have all of that in place, but that the ‘event’ was a fire outside the server room.  Lets face it, fire suppression works, but fires still happen. The suppression kicked in and as soon as the Fire Safety Officer has given the all-clear, servers are restarted within 30 minutes and the server room back to normal in good time.  In the meantime the arrangements are that test orders are written on card (a supply of pink and blue cards is maintained in central stores) and urgent orders telephoned through to the relevant department. Additional staff are drafted in to deal with the extra legwork and data entry.  All fine.  Except the fibre optic cables from the server room were housed in metal conduits.  The conduit got hot, and melted the cables so although the servers are up – they are still offline.  So manual fallback is still running.  Except for Orthopaedic Theatre – whose telphones run over VoIP – controlled – or not – by a server in the (currently) offline server room.  Oops.

Even the best laid plans sometimes need to pivot and adapt to retain BC.  And therein lies the key to true Business Continuity Planning – people.

If people on the ground (in clinical, admin and operational areas as well as technical) have the right skills and knowledge – technical, communications, and management, then whatever threat comes out of left field is vulnerable.  The more static and rigid the available skills – the more likely the Black Swan will bite.


Posted February 7, 2012 by eckythump in Uncategorized

Tagged with ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: