Five Keys to Success with APM in Production Environments – Continuous Monitoring (Part 1 of 5)

December 7, 2011 at 12:11 am 5 comments

By Diego Lomanto (Twitter: diego_lomanto)

This is the first of a five part series where we explore the critical factors of implementing APM in production environments successfully.  Please check back next week for part two.  

If you are currently evaluating an Application Performance Management (APM) solution you probably realize by now there are several capabilities that must be included in order to maximize the value of APM.  Gartner summed these up nicely in their recent magic quadrant report.  Dynamically generated topology maps, application diagnostics, transaction monitoring, end user experience, and reporting capabilities have become the table stakes for APM these days.  I talked a bit about using these dimensions to take a business transaction-driven approach to APM in  my last post.

These dimensions are the baseline requirements when considering an APM solution.  However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation.  Capabilities that don’t get as much coverage in the media. They are:

 

Over the next five blog entries I’ll spend a little bit of time on each of these success factors so you can be sure that you purchase and deploy a solution that will deliver the results you expect not just in development and testing environments but also in production. Let’s start with continuous monitoring:

Part 1 - Continuous Monitoring, NOT Exception-Based Monitoring

The first entry in this series deals with the value of enabling a continuous monitoring solution rather than an exception based one.  Many APM solutions have trouble dealing with high-volume environments so they function in a passive mode,  tracking mostly high-level metrics and basic KPIs, waiting for a pre-defined exception to occur.  Only then is a more active monitoring mode is entered.  Tier metrics are not a reflection of transaction health and have little to do with the end-user experience.

On the other hand, continuous monitoring solutions were built from the ground up with lower overhead so that they could run 24×7 on all transactions with low overhead.  We recommend a continuous approach in your production environment.  Here’s the rationale:

The Risk in Production with Exception-Based Solutions

There are a few problems with exception-based solutions:

  • Does not surface problems you haven’t defined as a breach in advance.  This is the main problem with an exception-based solutions.  If the administrators of the system have accurately planned for all of the breaches that might occur, then might be able to get data on problems within the environment.  But what if the breaches are not well-defined?  You end up with blind spots.  Everything looks fine because no red flags are getting reported.  But is that the reality?  How do you know if you can’t see everything?
  • Frequent smaller problems fall between the cracks because they occur sporadically and not consistently enough for the tool to decide that it is an “exception”.  However, all of these small problems often add up to poor end-user experience.  And even if such breaches do trigger the exception mechanism, what happens if it does not occur again while the exception based tool is watching?  Nothing gets reported.
  • Monitoring uncovers no problems because the issue occurred already and the system has returned to normal state.  And as soon as it goes back to passive mode the problems arise again, triggering the exception but no meaningful data.  You end up going around in circles and never truly resolving the problems.

What’s happening here is that exception-based solutions leave you with too many blind spots to manage application performance effectively.

The problem with exception-based monitoring solutions

The problem with exception-based monitoring solutions

Exception-based tools work this way in production to minimize their overhead and the amount of data that they capture. These tools were designed for helping developers debug their code, not for 24/7 production use, so they are not able to monitor and analyze millions of unique activities every day. They have to apply some sort of a selection mechanism to decide what to monitor and what can be ignored.

How Does Continuous Monitoring Help?

To deal with all future problems you need to be able to see everything.  You need to know what happened before the problem occurred and understand what’s happening right now.  You need to know what is considered normal.  Otherwise, how do you know what is abnormal?  Sometimes the problem is simply not definable in advance and flies under the radar of exception-based solutions.  For example, if an important database table gets deleted by accident, application performance might actually look to  be improving.  Exception-based solutions might not notice anything was wrong even though from the end users’ perspective all the data is gone.  This is a full-blown application outage.

Here’s what an effective continuous monitoring solution will do for you:

  • Discovers, classify and track all business transactions across multiple tiers and components.
  • Identify the exact performance details at each step that the application executes in order to quickly isolate problems.
  • Alert IT staff to developing service disruptions and anomalies long before they are detected by end users.
  • Enable IT to proactively manage application performance and prevent service level degradation or interruptions to business services.
  • Monitor transaction that had not been defined up-front as “transactions of interest”.

The diagram below depicts a dynamically generated topology map from a continuous monitoring solution that has automatically, and without any input from systems administrators, detected the true architecture of the application environment – including tiers that may be unexpectedly part of the transaction flow.

Dynamically Generated Topology in an APM Solution

Dynamically Generated Topology in an APM Solution (click for larger view)

That’s a powerful capability that you can’t get with exception-based technology.  Another example of where exception-based monitoring would fail is the common situation of a batch job or some other nightly activity that accidentally got kicked off in the middle of the business day. Such nightly processes often hammer the databases as they perform complex calculations and produce detailed reports. When running in the middle of the day, they will slow down other transactions that are also trying to access the databases.

What would an exception-based solution do?  At best, it will show that online transactions are slowing down, CPU and activity levels are high, and some systems may be running close to capacity, but it will not point to the offending batch job as the root-cause because batch jobs are not among the business activities that had been defined upfront for monitoring. The Operations manager might conclude that it is time to upgrade the hardware (because it is getting close to capacity in the middle of the day) without realizing that the hardware is just fine and the real issue has to do with a job scheduling error.

Those are just a few examples of the power of continuous monitoring in a production environment. For more you can visit the OpTier site.  What about you?  Have you come across any other good examples of a continuous monitoring solution detecting  problems that would have been missed by an exception-based methodology?  I’d love to hear some of your stories.

I’ll be back next week to discuss leveraging APM analytics to uncover root cause for the second part of this series.  If you’d like to  be notified when the post subscribe to our feed, click on the twitter button at the top of the page, or follow me on @diego_lomanto.


Advertisement

Entry filed under: APM. Tags: , , .

What is Business Transaction-Driven Application Performance Management? Five Keys to Success with APM in Production Environments – APM Analytics (Part 2 of 5)

5 Comments Add your own

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


OpTier Application Performance Management

OpTier Twitter


Follow

Get every new post delivered to your Inbox.