Archive for December, 2011
Five Keys to Success with APM in Production Environments – Real-Time APM (Part 3 of 5)
By Diego Lomanto (Twitter: diego_lomanto)
This is the third of a five part series where we explore the critical factors of implementing APM in production environments successfully. You can find parts one and two here. Please check back next week for part four.
In this series we are discussing how the Gartner Magic Quadrant provides a great start to implementing with APM solution. However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation. Capabilities that don’t get as much coverage in the media. They are:
- Continuous monitoring, NOT exception-based monitoring
- APM analytics that enable you to become more proactive with application/transaction data
- Smart alerting and real-time analytics impact
- Broad platform support eliminating all blind spots in your monitoring strategy
- Enterprise readiness for growth and scalability
Part 3 – Real-time Application Performance Monitoring Analytics
In part two of our series, we explored the value of APM analytics to conduct historical analysis of application performance in the short and long-term. Now let’s look at how critical it is assess application performance in real-time production environments to ensure success.
Real-time APM analytics allows you to make quick decisions about applications to improve performance, while understanding and quantify the business impact of your actions. A real-time analysis is typically triggered by a real-time alert from the APM solution, or real-time dashboard within the solution then followed up by a deep-dive into the data using real-time OLAP tools.
Real-time Alerts
APM Alerts can come via e-mail, text or even through a dedicated smartphone application like the one pictured below.
Most APM solutions provide alerts. However, look for one that integrates a “Complex Events Processing” (CEP) engine into the alerting algorithm in production. Without CEP your alerts from production environments often are without significance. CEP engines combine multiple events to generate an alert, giving you a thinking engine that can make correlations between different events and tell you something is gone wrong – even if it’s not apparent at first glance. Most application monitoring tools only generate alerts based on IT events, not business events
Real-time Dashboards
Dashboards monitor SLAs or other relevant key metrics that assess the health of your systems. Here is an examples of an APM real-time dashboards in action:
Common Real-Time Triggers
Here is an example of how an alert can be constructed in an APM solution with a complex events processing engine. In this scenario, all the transactions are being analyzed and the APM solution triggers an alert when business process SLA’s are breached. This is important because transaction SLA’s are set at the technical system level, not at business level. So while the transactions may not have breached their SLA, the business process is suffering because of poor performance of many transactions.
Other common triggers:
- Database CPU spikes caused by users with older browsers
- Search responding slowly to one specific query that was not optimized
- Chatty transactions inhibiting cloud-based transaction performance
- Imbalanced cluster of middleware tiers
Real-Time OLAP Engines
Once an alert is triggered or a dashboard reveals a problem in real-time, the next step is to slice and dice and look at the data from different points of view to isolate where the problem is coming from in real time. With real-time OLAP engines, you can view the transaction path, tier performance, end-user perspective, in order to isolate issues. just to name a few. . This is the true power of real-time analytics in production environments. And, it’s the difference between IT saying all KPIs are green and business saying orders falling off.
Some common real-time OLAP findings:
- Business process completed/failed/not completed
- Users were impacted by the current outage
- Locations were impacted by the current slowdown
- Unauthorized users accessing the application
- Topology changes
- Load balancing
- Slow databases
- High CPU resource consumption on specific tiers
- Web services provider underperforming
Tying it all Together
Armed with this knowledge, you can now go solve the problem. You can even assign business impact to the real-time analysis to prioritize your actions. For example, through real-time OLAP you can discover that two transactions are beginning to fail. One of the transactions is responsible for $1M a minute in revenue, and the other is worth $10k. It’s now easy to figure out what to solve first.
This type of analysis is crucial to operating an APM solution in a production environment effectively. With so much data available to analysts, it is imperative that not only can they make sense of it, but that they have access to the information as problems arise. Through a combination of alerts, dashboards and OLAP engines users can effectively monitor their infrastructure proactively.
Next week we’ll discuss broad platform support in part four of the series.
Saddle-up, Las Vegas!
By Linh C. Ho
Follow on Twitter: linh_ho_nyc
Another year ending with the Annual Gartner Data Center show in Las Vegas! This show has grown quite a bit since I first attended almost a decade ago. I believe over 2500 attendees this year roamed around the Caesars Palace with name badges hanging on the belly—some camouflaged with the fans of the National Finals Rodeo.
This year, the hot topic was not only cloud computing but seems there were a lot of discussions around DevOps and analytics. Other topics of interest included Application Performance Monitoring (APM), End-user Experience (EUE), Business Transaction Management (BTM), Big Data and many more around IT operations.
Take a gander (something like that) at the few bits and bites I picked up, particularly found the polling questions interesting to share.
Poll: what is the biggest reason ITOps group isn’t doing more innovation?
- 41% too busy with day to day operations
- 27% politics
- 11% cultural
- 9% feel they are very innovative
- 7% not a priority
- None or others
Not surprising. When IT is too busy with day to day operations like keeping the lights on, and expensive resources are stuck on a conference bridge figuring who is accountable to fix the problem; of course there is no time and resources put into innovation. Only when IT is proactive and preventive to issues that innovation has a spot on the agenda. Though, 9% feel very innovative (innuh-vaaa-div as rodeos would say) :-) – I’d be interested in hearing from this 9% to understand what they are working on! Innovation clearly brings new ideas that drive change and create value that can only enable better business outcomes. If you’ve had a positive change in business outcome due to innovation; please do share!
Poll: what is your top priority for availability and performance tool investment for the coming budget year?
- 26% APM
- 21% ECA
- 14% SLA
- 7% Virtualization
- 7% Server Monitoring
- 5% Network Monitoring
- 2% BSM
- 2% Cloud
Indeed APM is hot again, though this is referring to ‘new APM’ which is above and beyond the traditional deep-dive tools. One of the analyst cautioned about using last millennium’s tools to solve today’s problems! Deep-dive is only a slice of five dimensions of APM according to Gartner. Gartner did publish a 2011 APM Magic Quadrant which seven vendors were positioned in the leader’s quadrant (and meeting all five dimensions): OpTier, CA, Compuware, HP, IBM, Opnet and Quest. If you’re in the 26%, here’s a complimentary copy of the APM report.
APM inquiries according to the analysts seem to have increased by over 50% compared to last year. Within APM, end-user experience and business transaction profiling are touted the two hottest topics.
Event Correlation Analysis (ECA) is second to APM, beyond ECA but somewhat related, we see customers looking to apply analytics for both IT and business operations. Approaches such as multi-dimensional OLAP (online analytical processing), CEP (complex event processing) and log analysis are commonly seen. Log parsing and analysis comes up for those looking to parse log files to assist with trouble-shooting primarily. Bringing intelligence through the likes of CEP helps IT elevate its awareness of business impact, prioritization and prevent abnormal behaviors in both IT and business operations. Multi-dimensional OLAP helps bring different perspectives easily and quickly for problem isolation, impact assessment, resolution, optimization and more. For example one can view service levels by business transactions, by users, by applications or flip it around to get resource consumption by applications, users, and transactions—think of a rubik’s cube for IT management.

These all borrow business intelligence concepts for the world of IT management – which is not a bad thing when the ultimate goal is aligning IT and the business.
I am surprised to see only 2% cloud for the next coming year, perhaps the audience weren’t sure what Gartner meant by ‘cloud’ as it could be different extremes of initiatives. Or simply, the audience still isn’t quite ready. Lastly, I can’t say I am surprised to see 2% BSM (Business Service Management); this term has been nebulous for quite some time and since the last pure-play BSM vendor was picked up by Novell — we’re not hearing much from that corner.
DevOps surfaced quite a bit, here’s a couple of interesting polls.
Poll: Is your organization leveraging DevOps?
- 62% have not heard of DevOps before
- 11% aware of DevOps but not planning to use it
- 9% experimenting – not in production
- 8% using for both critical and non-critical apps
- 7% considering using DevOps in next 12-14 months
- 3% using it for non critical apps
This was a surprise – 62% have not heard of DevOps? To defend it, bridging the gap between dev and ops; we’re just not there yet. The reality is more “OpsDev” – how to help IT operation guys bring factual data to the Dev guys to fix issues that are causing pain in production. IT operation needs to be proactive at crossing over that wall. This can only improve productivity, communication and eliminate the traditional siloed approach. There is still some work to do here to break the great wall.

Poll: what process is most in need of being addressed via Devops?
- 36% Release management
- 35% Change management
- 12% performance management
- 12% capacity management
- Others
Not shocking. Change and release management are key processes to address via DevOps – how often do changes cause performance problems? Do you understand the change impact on your end-user experience, critical business transactions, application performance? Effective change and release processes can only be achieved with solid collaboration between dev and ops to minimize application rollbacks, improve quality releases and reduce risks of impacting performance.
Finally, kudos to the Gartner analysts for the informative sessions, one-on-ones, dinners and drinks – they’ve gone the extra mile for attendees and vendors! Usually Gartner will have a write up on the Data Center poll results the following spring; buckle up! it will be interesting to see what they make out of all this!

Until then, have a safe and happy holiday season!
Five Keys to Success with APM in Production Environments – APM Analytics (Part 2 of 5)
By Diego Lomanto (Twitter: diego_lomanto)
This is the second of a five part series where we explore the critical factors of implementing APM in production environments successfully. You can find part one here. Please check back next week for part three.
In this series we are discussing how the Gartner Magic Quadrant provides a great start to implementing with APM solution. However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation. Capabilities that don’t get as much coverage in the media. They are:
- Continuous monitoring, NOT exception-based monitoring
- APM analytics that enable you to become more proactive with application/transaction data
- Real-time monitoring for proactive APM analysis
- Broad platform support eliminating all blind spots in your monitoring strategy
- Enterprise readiness for growth and scalability
Over these five blog entries I’ll spend a little bit of time on each of these success factors so you can be sure that you purchase a solution that will deliver the results you expect, not just in development and testing environments but also in production.
Part 2 – APM Analytics
Last week we talked about the virtues of a continuous monitoring strategy at length. But now that we can see everything, we’re going to have to find a way to make sense of it. A major risk for APM solutions in production environments is that they simply overwhelm the end-user with data, or the opposite occurs. They don’t provide enough actionable intelligence. It’s just hard to manually determine what’s important.
This is where APM analytics comes into play. Analytics should not be an optional component of APM – it is vital to fulfill the promise of APM. It enables you to analyze application performance in ways previously impossible or requiring massive amounts of work. And, analytics makes APM accessible to the enterprise.
Types of Analysis
The easiest way to understand APM analytics is to look at the use cases. The common use cases of analytics for APM are real-time, short-term analysis and long term planning.
Real-time (e.g. A product server is about to violate an SLA)
- Real-time OLAP
- Alerting to isolate problems while they are happening in transactions, infrastructure and business process.
In the Short-term (Why were transactions 10% slower today?)
- Business event correlation for root cause analysis
- Capacity management
In the Long-term (What applications can we move to the cloud?)
- Improving user and application behavior
- Capacity planning
- Cloud architecture planning
Here are some of the common types of reports you will get:
Real-time analytics is a topic that deserves its own post, so I’ll cover that next week in detail. In this post we’ll focus on the short and long term use cases.
What Makes APM Analytics Work in Production?
Ok, sounds good so far, right? There’s a gotcha. You knew there would be! In order to provide right amount of actionable intelligence in a production environment, you must first start with good data. The concept of “garbage-in, garbage-out” hold very much true for APM analytics. Here’s the secret to good APM data: entity relationships.
Entity relationships hold information about the interaction of a transaction with other components of the infrastructure. (E.g. this transaction was in this tier for this long before moving to that tier). Entity relationships are crucial to APM analytics because they allow you to infer root cause. Most APM solutions cannot provide detailed entity relationships in production because they do not track all the tiers and they do not track all transactions. This all goes back to the continuous monitoring requirement from last week. You might be starting to see that the keys to success with APM in production are related to each.
Ok, Sounds Good. How About Some Examples?
Sure thing. At OpTier, we call the customizable part of our APM analytics Business Events and we’ve helped customers use it to detect the following:
- Poorly designed SQLs as the root-cause of slow transactions
- ESB wrongly orchestrating transactions
- Retail banking payment transactions traversing certain application components before final booking
- Trading transactions having specific cut-off times in the day
- Order fallouts (common for telcos)
- Resource-intensive batch tasks impacting online transaction activities
- Specific users impacting system performance
Let’s take a deeper dive. Here’s an example of APM analytics uncovering the root cause of where transactions are failing in the short term. In the screenshot we below are analyzing transaction flows and can see that there is a missing step in the overall process flow: “Send Invoice.”
The APM solution can detect and report “Send Invoice” as a root cause because of the entity relationships. There are relationships between tiers in a transaction flow, and when the system can understand that it can start to detect when those relationships start to change. The next step here is for an analyst to look at the invoicing system and determine why that step in the process is not occurring. This improves mean time to resolution, as the analyst is not forced to look at every tier, just the problematic ones. He or she can then get the issue over to developers to fix faster than they could have before APM analytics was available in production.
That is just one example of the power APM analytics in a production environment. Because of the depth of information, APM analysts need a way to parse through the volume to get to the root causes. Analytics is the key to delivering this success in production
What do you think? Have you come across any other good examples of analytics? I’d love to hear some of your stories.
Stay tuned for the next installment of this series where we will discuss leveraging real-time analysis to proactively monitor applications.for the third part of this series. If you’d like to be notified when the post goes up please follow me on twitter @diego_lomanto.
Five Keys to Success with APM in Production Environments – Continuous Monitoring (Part 1 of 5)
By Diego Lomanto (Twitter: diego_lomanto)
This is the first of a five part series where we explore the critical factors of implementing APM in production environments successfully. Please check back next week for part two.
If you are currently evaluating an Application Performance Management (APM) solution you probably realize by now there are several capabilities that must be included in order to maximize the value of APM. Gartner summed these up nicely in their recent magic quadrant report. Dynamically generated topology maps, application diagnostics, transaction monitoring, end user experience, and reporting capabilities have become the table stakes for APM these days. I talked a bit about using these dimensions to take a business transaction-driven approach to APM in my last post.
These dimensions are the baseline requirements when considering an APM solution. However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation. Capabilities that don’t get as much coverage in the media. They are:
- Continuous monitoring, NOT exception-based monitoring
- APM analytics that enable you to become more proactive with application/transaction data
- Real-time monitoring for proactive APM analysis
- Broad platform support eliminating all blind spots in your monitoring strategy
- Enterprise readiness for growth and scalability
Over the next five blog entries I’ll spend a little bit of time on each of these success factors so you can be sure that you purchase and deploy a solution that will deliver the results you expect not just in development and testing environments but also in production. Let’s start with continuous monitoring:
Part 1 - Continuous Monitoring, NOT Exception-Based Monitoring
The first entry in this series deals with the value of enabling a continuous monitoring solution rather than an exception based one. Many APM solutions have trouble dealing with high-volume environments so they function in a passive mode, tracking mostly high-level metrics and basic KPIs, waiting for a pre-defined exception to occur. Only then is a more active monitoring mode is entered. Tier metrics are not a reflection of transaction health and have little to do with the end-user experience.
On the other hand, continuous monitoring solutions were built from the ground up with lower overhead so that they could run 24×7 on all transactions with low overhead. We recommend a continuous approach in your production environment. Here’s the rationale:
The Risk in Production with Exception-Based Solutions
There are a few problems with exception-based solutions:
- Does not surface problems you haven’t defined as a breach in advance. This is the main problem with an exception-based solutions. If the administrators of the system have accurately planned for all of the breaches that might occur, then might be able to get data on problems within the environment. But what if the breaches are not well-defined? You end up with blind spots. Everything looks fine because no red flags are getting reported. But is that the reality? How do you know if you can’t see everything?
- Frequent smaller problems fall between the cracks because they occur sporadically and not consistently enough for the tool to decide that it is an “exception”. However, all of these small problems often add up to poor end-user experience. And even if such breaches do trigger the exception mechanism, what happens if it does not occur again while the exception based tool is watching? Nothing gets reported.
- Monitoring uncovers no problems because the issue occurred already and the system has returned to normal state. And as soon as it goes back to passive mode the problems arise again, triggering the exception but no meaningful data. You end up going around in circles and never truly resolving the problems.
What’s happening here is that exception-based solutions leave you with too many blind spots to manage application performance effectively.
Exception-based tools work this way in production to minimize their overhead and the amount of data that they capture. These tools were designed for helping developers debug their code, not for 24/7 production use, so they are not able to monitor and analyze millions of unique activities every day. They have to apply some sort of a selection mechanism to decide what to monitor and what can be ignored.
How Does Continuous Monitoring Help?
To deal with all future problems you need to be able to see everything. You need to know what happened before the problem occurred and understand what’s happening right now. You need to know what is considered normal. Otherwise, how do you know what is abnormal? Sometimes the problem is simply not definable in advance and flies under the radar of exception-based solutions. For example, if an important database table gets deleted by accident, application performance might actually look to be improving. Exception-based solutions might not notice anything was wrong even though from the end users’ perspective all the data is gone. This is a full-blown application outage.
Here’s what an effective continuous monitoring solution will do for you:
- Discovers, classify and track all business transactions across multiple tiers and components.
- Identify the exact performance details at each step that the application executes in order to quickly isolate problems.
- Alert IT staff to developing service disruptions and anomalies long before they are detected by end users.
- Enable IT to proactively manage application performance and prevent service level degradation or interruptions to business services.
- Monitor transaction that had not been defined up-front as “transactions of interest”.
The diagram below depicts a dynamically generated topology map from a continuous monitoring solution that has automatically, and without any input from systems administrators, detected the true architecture of the application environment – including tiers that may be unexpectedly part of the transaction flow.
That’s a powerful capability that you can’t get with exception-based technology. Another example of where exception-based monitoring would fail is the common situation of a batch job or some other nightly activity that accidentally got kicked off in the middle of the business day. Such nightly processes often hammer the databases as they perform complex calculations and produce detailed reports. When running in the middle of the day, they will slow down other transactions that are also trying to access the databases.
What would an exception-based solution do? At best, it will show that online transactions are slowing down, CPU and activity levels are high, and some systems may be running close to capacity, but it will not point to the offending batch job as the root-cause because batch jobs are not among the business activities that had been defined upfront for monitoring. The Operations manager might conclude that it is time to upgrade the hardware (because it is getting close to capacity in the middle of the day) without realizing that the hardware is just fine and the real issue has to do with a job scheduling error.
Those are just a few examples of the power of continuous monitoring in a production environment. For more you can visit the OpTier site. What about you? Have you come across any other good examples of a continuous monitoring solution detecting problems that would have been missed by an exception-based methodology? I’d love to hear some of your stories.
I’ll be back next week to discuss leveraging APM analytics to uncover root cause for the second part of this series. If you’d like to be notified when the post subscribe to our feed, click on the twitter button at the top of the page, or follow me on @diego_lomanto.






