These are the days of the big data buzz and the proliferation of mind-numbing reports to beef up any kind of argument imaginable. But I'm reminded of one fight between the IT operations and development teams that showed how simple metrics can decidedly prove a point. It was before the advent of devops, a classic case of many different teams involved in day-to-day IT operations, and begs the question of "can't we all just get along?"
It was the mid-1980s, and I was working at a regional bank in the data center group as a fairly new first-level manager of the IT operations support team. All computer processing, mostly batch work, happened on either Honeywell or IBM mainframes.
More about programming careers
There’s more to being a code slinger than late nights, coffee, and grey cubicles. Get the inside story in JavaWorld's inteviews featuring career programmers from Zappos, Intuit, and the world of independent consulting. Follow Off the Record on Twitter and get the Enterprise Java newsletter delivered to your inbox.
There was one major annoyance to our team: dropping everything to deal with failures caused by the development team's simple errors. For example, basic typos were made on JCL (job control language) -- long before software was available to catch such problems.
Our team knew that some of the developers were less than careful with testing changes, partly due to the fact that one of our team's jobs was to fix batch job failures ASAP without the developers. We were on-site 24/7, and they were usually at home when the batch cycles were running. We could either wait for them to arrive or do it ourselves. We went with the latter (faster) option.
Even more frustrating, the developers' managers weren't holding their own people responsible for the numerous job failures. Developers could toss in a half-baked change request and think to themselves, "Someone else will fix it if it goes south" -- which is exactly what we did.
Not only was operations taken away from our other projects by these missteps, but the failures made our whole business look bad. For example, if a botched batch job prevented updates to the Demand Deposit Accounts (DDA) system, we were at risk of not having current checking, savings, and CD account information available in the branches. If that happened too often, senior bank management would "have words" with the IT managers. It was also very embarrassing when our tellers had to tell customers, "Sorry, we can't give you your current balance right now."
Our boss talked to development's manager many times about their quality control, but to no avail. Finally my boss came up with a way to get the point across. There was a daily processing report, created on paper, that came from our data center group. Copies were distributed to most managers and to others who requested it. This report included such info as check volumes, start and completion times of key batch jobs, and a list of jobs that failed or "ABENDed" ("ABEND" standing for "abnormal end" of the job). My boss's goal was to show a connection between the number of changes submitted by development and the number of ABENDs.
Fortunately, it was easy to get the numbers: Back then, a change request had to be submitted on a three-part carbon-paper form and filed. One change request could be as trivial as a small tweak in batch JCL or as large as dozens of program updates, but every single one had to be documented.
My boss hit upon a very simple and effective metric that he had added to the bottom of this daily report -- with no advance warning. It reported, for the prior week, three numbers:
- Number of change requests submitted by development
- Number of batch job failures
- Change-to-failure ratio (percentage based on the prior two numbers)
The metric was crude, but very effective. When initially published, the change-to-failure ratio was over 25 percent. Without having to explain anything to anyone, this said one out of four changes submitted by the development teams failed.
The development team managers were furious! They sputtered to explain that operations could not say a specific change caused a specific batch job failure. My boss patiently explained he wasn't saying that -- he was just publishing hard numbers, which spoke for themselves. My boss's argument: "If nothing is changed, we have a lot fewer failures, so overall we know that changes cause failures."
Despite the complaints from development, my boss's boss (the data center manager) refused to drop that metric from the daily report. Development's management realized they needed to get serious about quality control.
To their credit, in a bit over three months the change-to-failure ratio dropped to about 10 percent -- exactly the type of improvement my boss had been hoping for. As a result, the people on our team had extra time to work on more fun projects than fixing ABENDs.
Quality control was important to development's managers up to a point, but there was always pressure to move on to the next project or change request from the business. What my boss did caused them to realize that their lack of more thorough quality control was creating problems that could be easily avoided. When we had fewer batch job ABENDs, we had fewer instances of not having the DDA system miss its service-level agreement up-time.
This, in turn, made for much happier tellers, not to mention more satisfied customers and executives.
Send your own crazy-but-true tale of managing IT, personal bloopers, supporting users, or dealing with bureaucratic nonsense to firstname.lastname@example.org. If we publish it, you'll receive a $50 American Express gift cheque.
This story, "Ops to dev: It's your fault, and here's proof," was originally published at InfoWorld.com. Read more crazy-but-true stories in the anonymous Off the Record blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.
This story, "Ops to dev: It's your fault, and here's proof" was originally published by InfoWorld .