Metrics and key performance indicators (KPIs) for data centers used to be inward looking. In the not so distant past, KPIs tended to be “about the box,” measuring factors like server utilization.

Information technology (IT) managers steadily realized there was more to data centers than utilization, or even uptime. In particular, quality of service to business users grew in importance. As a result, we began to see metrics in areas like query and transaction response times, or with IT help desk, job ticket turnaround.

In recent years, the energy costs of data centers came into focus. In a 2007 report to Congress, the Environmental Protection Agency figured thatU.S. data centers consumed about 61 billion kilowatt-hours (kWh) in 2006, about as much energy as used by 5.8 millionU.S. households. It also predicted that by 2011, energy use inU.S. data centers would exceed 100 billion kWh, or $7.4 billion in annual cost.

As highlighted in the white paper 154, “Electrical Efficiency Measurement for Data Centers,” a 1 MW high availability data center can consume $20 million in electricity over its lifespan. What can be done to minimize such huge spend levels? Well, we know it’s not just IT hardware consuming electricity. As the paper points out, the power and cooling infrastructure in a typical installation can consume half the electricity.

This realization has elevated the importance of the Power Usage Effectiveness (PUE) metric as a way to measure the efficiency of a data center’s physical infrastructure.  PUE has become one of the most vital data center metrics. If managers measure PUE effectively, they have a great tool for ensuring energy is spent where it matters most, but there also is the danger of getting overly hung up on one metric.

The best approach is more holistic, taking a step back to consider bigger objectives, including:

  • Having a dashboard framework that is able to integrate metrics from all types of systems: data center infrastructure management (DCIM), systems management software to track application performance, IT help desk metrics, and even metrics from enterprise resources planning systems that might contain financial goals for IT operations.
  • The ability to drill down from higher level metrics such as PUE to lower level metrics on factors like temperature or humidity. A DCIM framework can provide links to lower-level visualization to support a closed-loop approach to moving metrics in the right direction. For a detailed examination of how the subsets within DCIM work together to improve operations,
  • Consistency in measurement, especially with distributed data centers. For instance, is one data center PUE calculation incorporating power consumption estimates for switch gear, while another isn’t?  Does one data center measure power at the racks, while the others measure at the uninterruptable power supply?

In other areas of business, the need to think more broadly about metrics and link them with strategy is seen in methods such as The Balanced Scorecard, which combines financial and non-financial measures to better gauge corporate performance. Similarly, data center managers need the right mix of metrics.

As Kevin Brown notes in a recent blog post, we risk becoming “metric zombies” if we assess a metric like PUE in a ***, irrespective of projects like data center consolidations that might temporarily harm PUE, but whose achievement is necessary to keep budgetary mandates on track. In short, think first about the whole range of metrics and drill downs you want in a dashboard—not just one number.

Every threat of outsourcing implies that IT's function has been commoditized, and that meeting business needs is simply a matter of negotiating price.
Unfortunately, while IT may be able to make some service level assurances and stick to a budget, they have little to demonstrate how efficiently they are really operating or how much competitive agility they really can bring to the company.
Some organizations have tried to expose "costs" by creating chargeback reports to bill IT spending back to business services. While this does create some internal economy-driven benefits by setting a price for each service, the practical goal of these accounting projects is to fairly recoup the full IT expenditure regardless of how big it was or how it was spent.
Often the resulting price per service causes unintended consequences for the business - users throttling their IT service usage or meeting service needs in unaccountable ways. Unfortunately, cost allocation schemes can create "pushback" on business demand for services while failing to expose how well IT is managing the total business investment in IT.
Chargeback alone won't work to get IT and the business working together - a whole set of IT-business processes like those espoused in Information Technology Infrastructure Library (ITIL) might be required.
ITIL training is fast becoming a job requirement for CIOs. The popularity of ITIL mandates with CXO types serves to highlight the desperation of business to make some sense of IT's independent nature.
But even in organizations with formal visible processes for IT financial management and capacity planning, IT requests for new investments or acquiring additional assets are still often decided by political "popularity" contests rather than by mathematically defensible justifications. Even worse, there is little to no accountability on the return of past IT investments to help judge current performance or direct future spending. Clearly, process alone can't align IT with the business.
Often, the only numerical reporting IT can produce shows how "available" discrete resources were utilized or how specific devices might have been utilized over time.
This level of information doesn't tie to the value that the business gets out of IT infrastructure once you get past accounting for the negative penalties of downtime.
Today's major IT virtualization efforts are designed to create shared pools of resources, greatly increasing "efficiency" from an investment perspective. Virtualization also provides a shared reserve of available capacity and a dynamic environment to flexibly handle changes on demand.
But the inherent abstraction that today's server and storage virtualization technologies presents is a real challenge for IT management. How does IT visualize a service's allocated infrastructure across multiple virtualized IT domains?
How can IT know, and prove to the business, that resources buried several layers down in the IT stack are contributing effectively to service performance, are being efficiently and optimally used for the business, and can be managed by IT in an agile way?
As examples, consider the recent trends to virtualize distributed system servers (VMware, Virtual Iron, Microsoft Virtual Server, Citrix XEN, and further virtualize data center storage (IBM SVC, Apple, HDS UVM, Netapp V-series, Incipient, Virtualization is inevitable as resources become seen as commodities. But these kinds of major virtualization create real management challenges.
Sadly, the current state of affairs is that the business only sees IT as a "black box" - they don't know if IT is doing things well internally, is optimally applying the investments already made, or can be relied upon to adapt and change.
Existing system management solutions can address the business view of workloads and can measure end-user response times. This is great for reaching maturity on service level management. And IT has lots of reports that show the utilization of the CPU's on servers, and the disk space used, and how much network bandwidth is left. If this is the case, couldn't the business just hold IT accountable for good response time AND high device utilization? Isn't that what it means to be effective and efficient?
Technically the answer is yes, but this has been practical only if devices were dedicated one-to-one with services so that we could treat all the resources for a service as a dedicated "system" box.
When a business transaction enters IT, it traverses a "supply chain" of IT domain-specific service providers. Like fractal geometry, there is a self-similarity of how domains interact within IT to how IT as a whole interacts with the business. IT management needs to manage IT across domains, getting internal service metrics on each domain just as Business-IT management needs the service metrics for the whole IT system.
New cross-domain solutions are emerging that help IT generate service metrics across virtualized domains. These new solutions collect data from within each IT domain to "unpeel" the virtualization abstraction and gain visibility on the actual resources assigned to each service. They do this by modeling the queuing behavior across both physical and virtualized domains to provide the necessary management insight into how IT is really operating.
There is a way to measure internal IT performance that can make sense to the business. Let's start by examining the three basic performance metrics that describe a system where the system is treated as a single "box". Since we are concerned with IT's performance in service delivery, our basic metrics are:
APPLICATION "WORKLOAD" - The load or "demand" made on the system by users. In a stable system this is also equal to the throughput, and is usually described by the business in terms of business transactions. IT will need to put some effort into translating a business transaction into units of work that are executed within the IT infrastructure, but this is often addressed through established capacity planning and chargeback methodologies.

RESPONSE TIME - The main measurement of performance. This is the time each transaction takes to complete. End-user transactions can be externally "clocked" in many ways, and this is often accomplished through implementation of service level management solutions.

UTILIZATION - The effective "busy-ness" of the IT system that services the workload. Once utilization reaches 100 percent, no more work can be done.
Interestingly, these three metrics are related by queuing theory, which in a nutshell states that the more work going through a system, the busier it gets linearly, but the response time gets worse non-linearly. In other words, if we just cared about maximizing throughput, we could drive enough work to make the system 100 percent busy, but if we care about performance service levels, we have to do some queuing math to understand how much work the system can do before it slows down.
As the IT "system" is decomposed into physical and virtual management domains, the performance metrics described above can now be generated at each layer:
TRANSACTION WORKLOAD - The amount of actual work required from each domain to service a customer's request.

INTERNAL RESPONSE TIME - The response time for a transaction to complete its work across a specified set of domains (See figure 1). For example, if you manage IT infrastructure that includes Server and Storage domains, you might create an "Infrastructure Response Time" that will serve as your primary service metric.

EFFECTIVE UTILIZATION - The effective utilization of the physical and virtual resources assigned to a service. For example, a virtual server with a specified "limit" would be at 100 percent utilization at that limit.
Application Infrastructure Performance Overview
Figure 1: Internal Response Time is the response time for a transaction to complete its work across a specified set of domains.
More significantly, there are some key derived scores and indices that can be built from the new cross-domain performance models to help manage IT efficiency and agility, both at the domain level and at the overall data center level. A well-built, cross-domain performance model first produces an optimal operating goal for a resource that indicates the maximum effective utilization while still ensuring good performance. For each IT service, IT management can then report on the following system performance indicators:
PERFORMANCE INDEX - A score for how well a particular workload's set of assigned resources are being utilized compared to the optimum level. This index immediately shows whether resources have remaining capacity, are being over-utilized, or are aligned "just right" to meet the demand. The percent of time this index remains in a favorable range can be used as an indicator of system performance "reliability".

SYSTEM EFFICIENCY - A key performance indicator (KPI) that tracks alignment of IT resources to workload demand over time. Highly efficient systems allocate just enough resource to meet current loads. Inefficient systems might be ripe for consolidation or technology refresh initiatives.

SYSTEM AGILITY - A KPI showing the variance in the efficient alignment of IT resources to workload over time. A high variance indicates low agility of the IT domains to respond to changing workload, likely because of inflexibly dedicated resources. Virtualized and dynamically re-balanced domains will have high agility scores.
"Data Center Efficiency" and "Data Center Agility" scores are determined by rolling up system efficiency and agility scores from IT domains. Internal response time numbers are rolled up to produce a "Data Center Effectiveness" rating. If these KPI's are carefully constructed to be on a 0 to 100 scale, then any business or IT manager can easily determine the state of IT.
Data center performance metrics have become useful in deriving Key Performance Indicators for managing IT Infrastructure from a business perspective. IT can now report service performance scores in:
1) Effectiveness in delivering service
2) Efficiency with respect to meeting performance requirements
3) Agility in responding dynamically to change
When IT has real numbers that can be taken back to the business to show how they are operating their infrastructure, they gain a tremendous amount of credibility. As IT and their business folk negotiate with this kind of information between them, it becomes possible to accurately assess past performance, make intelligent new IT investment decisions, and set realistic measurable goals.
With the right cross-domain performance management solution, the data center KPI's mentioned above are automatically created for both dedicated physical and dynamic virtualized architectures. This enables IT to operationally manage infrastructure domains day-to-day to ensure operational delivery and reliability. Deviations from "normal" can be quickly alerted and acted upon. ITIL service support processes (e.g. incident and problem management) can be driven and managed "horizontally" across the whole set of IT domains
Even more powerfully, IT management now has real metrics that can trigger, guide, and assess the results of projects designed to optimize infrastructure. Poor efficiency scores can be used to initiate consolidation efforts. Low agility scores can drive virtualization deployments. While raising these scores, infrastructure response times can be monitored to ensure that the end effective service delivery is maintained or even improves.
And perhaps most significantly, the models that produce these metrics can be used predicatively to recommend future scenarios. For example, if the business is forecasting a growth campaign, IT can project what types of new investments they will need, and what the various investment scenarios will mean to IT service effectiveness and data center efficiency. Technology refresh requirements, vendor lease negotiations, and even outsourcing alternatives can be fairly evaluated for effect. ITIL service delivery processes (e.g.legalfinancial and capacity management), those that help align IT with the business, are now enabled with mathematically-based decision-making information.
Application Total Response Time Forecast Chart
Figure 2: Models can produce future infrastructure response time for each application.
In all of these cases, before and after IT performance metrics can be presented to the business bringing IT visibility and control back into the boardroom. IT can better be part of the business value dialog in budgeting and prioritizing corporate investment to support growth and profitability.
Organizations with mature IT financial management processes can even use these resulting indices and scores to create believable dollar-value ROI analyses for various projects and scenarios.
Using IT data center effectiveness, efficiency, and agility scores enable managing IT as a business and provide a way to measure how well IT aligns with and serves the larger corporation. IT is back in business! ENS