Didn't test? Then don't deploy

Test, test, and test some more -- or leap to production at your own peril

Few feats in IT are as rewarding as standing up a new infrastructure and molding it into a production system. Though we may be clicking more than racking these days, thanks to virtualization, it's still exciting to see the fruits of your labor ripen into a solid and reliable resource. In a perfect world, everything comes together perfectly, with every aspect planned and executed precisely as required, when required, and the end result is immediately ready for work.

In reality, we know it's not quite that easy.

More about Testing & debugging

See JavaWorld's Testing & debugging research center for more news and tutorials on this topic.

Get the Enterprise Java newsletter delivered to your inbox.

I'll use a new virtualization cluster as an example here, but this applies to just about everything in IT, from the network layer to the application layer. The nuts and bolts of the construction are somewhat formulaic. You pull together your chosen server, switching, and storage hardware and start hooking up everything. Your original design is followed more or less to the letter, and as you progress through the build, you hope not to run into anything too surprising, such as an unexpected driver incompatibility or a buggy software stack. Even if all proceeds according to plan and everything looks like it's ready to go, you're far from done. You're really just starting.

Because now you have to hammer the bejesus out of everything until you're confident it's 100 percent ready for production. And you're never going to be 100 percent certain.

It's human nature to hurry toward the light at the end of the tunnel, to hasten your steps as you sense the end of a journey or task approaching. Where we might have been methodical and painstaking with our progress initially, we have an urge to gloss over many seemingly innocuous or minor details when we're nearing the end. And here there be tygers.

Back to our new virtualization cluster: We're replacing an older cluster with a large pile of bigger better faster more, moving from 1G to 10G, from relatively slow storage to fast storage, maybe even as far as Clovertown to Sandy Bridge. The new gear is going to make everything easier and faster, and there are few who don't look forward to its implementation. Ports are wired, switches configured, storage initialized, shares and LUNs created, the whole miasma of a modern virtualization build happens in relatively rapid succession. After all, if it's a good design -- a cookie-cutter build.

A few test VMs are built, and they're fast and responsive, blowing the doors off of their elderly counterparts. They appear to work perfectly, and quick testing shows everything as 5-by-5. This is precisely where the desire to leap ahead and throw the system into production takes hold -- and where cooler heads need to lock it down to spend days or weeks running comprehensive tests on every element before production workloads are introduced.

First, we need to thoroughly exercise the storage from every host in the cluster. Fortunately, it's extremely easy with virtualization. A quick Linux VM build and scripts using Bonnie++ or even dd run through a loop and clone the whole shebang as many times as necessary to introduce a significant load on each physical host in the cluster, hitting every planned LUN or share on the storage. With randomized sleep times, this produces a randomized workload of streaming reads and writes or a randomized workload of random reads and writes or whatever you like. If you really want to stress out a storage subsystem, there are few better ways to do it.

Now, after watching that for a few days and noting the absence of network or storage errors, we should add to the load. Toss Netperf or a similar network stress tool on each of those test VMs, and write a quick script to randomize TCP traffic of different sizes and payloads, with different test durations between all the VMs, and loop it the same way. Run that concurrently with the storage workload. If you want to add to the misery, throw in a few other VMs with a large number of virtual CPUs and RAM, then run CPU and RAM stress routines on them. At this point we should be hammering the hell out of just about every aspect of the cluster, from CPU to storage, from RAM to the network. If something's going to break, this would be where that happens, at least in theory.

Right about at that point is where I'd start trying to break things. Pull a host's power and make sure any fail-over actions happen appropriately. Run an automated host upgrade process and watch it carefully. Yank a network cable, or shut down the relevant switch port and make sure that bonding and fail-over network links work like they're supposed to. Also check to see all this happens under load -- that's when it's most important.

For some, this is one of the best parts: to come up with ways to beat the stuffing out of fresh gear, poking for weaknesses and holes. For everyone, the benefits are indispensable. For one thing, it allows a certain peace of mind after the production workload shifts; for another, it's vastly easier than trying to fix a big problem that was missed early on and winds up causing production outages.

So test, test, and test some more. Have some fun cooking up creative ways to stress every subsystem, every component, and ease everything into production after a reasonable breaking-in period. That light will still be on when you get there, perhaps a bit brighter and more soothing than before.

This story, "Didn't test? Then don't deploy," was originally published at InfoWorld.com. Read more of Paul Venezia's The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

This story, "Didn't test? Then don't deploy" was originally published by InfoWorld.

Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.