One of the more overloaded concepts in the IT management world is the Service Level Agreement (SLA). There is the ITIL definition of SLA that most IT people generally accept, but it really doesn’t tell the full story of how SLAs are used in practice. If you ask 10 different IT organizations how they use SLAs, you are likely to get 10 different answers. I think it’s safe to say that every IT organization has at least one SLA of some sort. Usually there will be some kind of operational SLA that describes help desk response time.
More relevant to the world of Application Performance Monitoring (APM), there may also be application performance-based SLAs such as, 95% of traffic to the company website will have sub-3 second response time when rendering the home page. If an SLA is important enough, IT and the business negotiate what’s necessary to meet requirements and report against them on a recurring basis, maybe as often as daily for some organizations with particularly performance critical applications. These reports are an essential element of any SLA. The calculations and reports themselves are not complicated, assuming you have monitoring tools capable of capturing the right data.
It starts to get more complicated (and I find myself getting more interested) when application performance-based SLAs are used to make application support efforts more proactive in an effort to avoid potential SLA violations. Ideally your APM solution will provide advanced warning to a performance problem based on monitoring of your SLA criteria. This empowers you to address the issue before it crosses any important thresholds that would show up in the recurring SLA report. ITIL doesn’t provide any guidance on how to proactively use SLA reporting to prevent SLA violations.
So what are we to do? Some general guidelines I have seen with Quest Foglight customers who consider themselves successful with SLAs are:
  • Start simple: Focus on a percentage of traffic with an acceptable level of response time for the application or web pages you wish to monitor. If you want to add more factors to the calculation later, that’s okay but start simple and build from there.
  • The end user perspective is critical: It does little good to understand that the database is performing to expectations but it’s taking 10 seconds to retrieve user account information. Monitoring user experience is key to successfully delivering successful services on an ongoing basis.
  • Monitor for a while before enforcing: If you haven’t used application performance SLAs in the past take the time to establish a baseline of where your SLA’s are today. It does no good to make an SLA of sub-3 second response time for 95% of traffic if 40% of the traffic today takes more than 5 seconds.

There are others as well, but these are good places to start.
I’m interested to hear how you’re using application performance SLAs? Are they strictly enforced (with penalties, etc.) or just used for general tracking toward goals of improvement?