When I wrote my blog post about important features of a service management tool, I wrote “I am increasingly coming to believe that the change management process is perhaps the most important one for success in ITIL – not to mention DevOps – adoption” – and I stand by that assessment. Not because change management is such a revolutionary idea in and of itself, but because of what true management of change means for a business.
I am, of course, referring (however obliquely) to Eliyahu Goldratt, and the Theory of Constraints. In my role as Change Manager, I have observed IT staff nigh on break their backs to get enough done during maintenance windows – essentially attempting to achieve feats of heroism. The desire to avoid down-time outside of the maintenance window, though noble, results in what I believe to be a misguided approach to change and maintenance work.
In most industries, the idea of routinely working outside of office hours would – at the very least – be a last resort, and something which employees, managers, and unions alike would actively oppose. Yet, frustratingly, this is a commonly accepted practice in the field of information technology (and particularly in the Operations realm). Not only can we do better – I think we owe it to ourselves, not to mention that we owe it to the generations that will follow.
Imagine not having to install patches on production servers outside of business hours. Imagine performing maintenance on business critical systems while they are in production, at worst taking the systems down for a fraction of a second. Imagine being able to upgrade critical systems while keeping data available to users. Imagine migrating servers from one site to another, in production (or better yet, imagine the actual location of each server farm to be irrelevant, as the servers are floating across all of them).
The biggest problem here are databases. While you can have two instances of the same server (e.g. through the application of blue/green deployment), databases are not so easy. I maintain that partial availability is preferable to complete downtime, wherein lies one approach with regards to databases. Make a copy of the database, make the master read only, and upgrade the copy. Once the upgrade is complete and checks are done, put the upgraded system into production.
By adding a few additional steps in the upgrade process, we ensure that the system remains available for information lookup, with the added benefit of a far simpler reversion, should the upgrade process fail. Though a fully redundant spare instance of all (or most) servers in staging is the ideal, even maintaining a cold mirror of a server makes fail-over simpler – not to mention the fact that it makes migration from one server room to another possible with a minimum of downtime.
There are numerous other examples of the practices I describe above. Providing staff with the ability to perform most of these tasks during normal business hours offers a number of benefits:
- Reductions in overtime costs involved with the work
- Higher staff availability for swarming problems that may occur
- Increased availability of systems to the end user
- Reduction in the impact of severe systems outages
- Reduction in the risk of staff burn-out
It is my hope that DevOps will become the norm, not the outlier. It is my hope that working outside of ordinary business hours will become the exception, not the rule.