Author Archive
Five Keys to Success with APM in Production Environments – Enterprise Scale and Readiness (Part 5 of 5)
By Diego Lomanto (Twitter: diego_lomanto)
Welcome to the concluding post in our series on the Five Keys to Success with APM in Production Environments. In this series we have been discussing how the Gartner Magic Quadrant provides a great start to implementing an APM solution. However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation. To refresh your memory, so far we’ve covered:
- Continuous monitoring, NOT exception-based monitoring
- APM analytics that enable you to become more proactive with application/transaction data
- Smart alerting and real-time analytics impact
- Broad platform support to eliminate blind spots in your monitoring strategy
Now, let’s talk about enterprise scale and readiness
Enterprise Scale and Readiness
If you can check off everything above – you’re in good shape. But there’s one more factor to success with APM in production. When you roll out your investment, will it be able to handle the scale of your business today and when it grows? If you are investing in APM, then you probably have a high-volume of critical transactions to analyze. Your solution must be able to handle hundreds of millions of transactions per day. It cannot fail just as transactions sharply rise. These are precisely the times you need APM! It goes back to the whole continuous monitoring vs exception-based monitoring argument and ensuring that the solution does not fail at crunch time, or work around limitations by limiting scale of deployment.
You must also be sure that the solution itself can scale easily as your environment grows. IT is fluid, and APM must be as well. It’s not just about number of transactions either. Determine if it’s easy to add tracking on new tiers or is that a whole project unto itself? Make sure your vendor provides a path to application monitoring expansion that is manageable.
Speaking of manageability, some other key requirements are that the solution should not have a single point of failure, should be able to be remotely configured and should have a high availability. In addition, whatever method of tracking employed should have low overhead so you that it doesn’t crash once you put it into production. This is typically a problem for exception-based solutions so keep your eyes on that.
Now, You are Prepared
Ok, now we you have a better handle on the ins and outs of APM in production. It’s important that you pick a product that not only includes the five dimensions of APM, but also can run effectively in production and provide the right information to you at the right time. If you look for these requirements during the research phase, you’ll have a much better experience during roll out. And your solution will provide much more value, both for IT and the enterprise.
Thanks for tuning in during this series. It’s been great walking through these critical success factors and judging by the traffic it’s helped a few people. If you have any comments, I would love to hear them in the comment box below, or on twitter @diego_lomanto.
Five Keys to Success with APM in Production Environments – Broad Platform Support (Part 4 of 5)
By Diego Lomanto (Twitter: diego_lomanto)
This is the fourth of a five part series where we explore the critical factors of implementing APM in production environments successfully. You can find partsone , two and three here. Please check back next for part five.
In this series we are discussing how the Gartner Magic Quadrant provides a great start to implementing with APM solution. However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation. Capabilities that don’t get as much coverage in the media. They are:
- Continuous monitoring, NOT exception-based monitoring
- APM analytics that enable you to become more proactive with application/transaction data
- Smart alerting and real-time analytics impact
- Broad platform support to eliminate blind spots in your monitoring strategy
- Enterprise readiness for growth and scalability
Part 4 – Broad Platform Support to Eliminate Blind Spots in your Monitoring Strategy
Your APM solution must be adaptable to support everything you have within your environment. This includes support not just for application processing tiers, but also databases and middleware, as they are often the cause of performance problems. A complete APM solution should support a diverse mix of SOA, private and public cloud, middleware, databases, homegrown, legacy and proprietary technology stacks. A solution that will thrive in production should also support applications of unknown design/no access to code. This provides IT with visibility into 3rd-party applications’ and components’ performance and reduces the time to identify performance issues associated with 3rd party applications and components.
At the heart of broad platform support is the capability to track each and every transaction instance through its entire life cycle. To enable this, there are multiple techniques to track transactions across virtually any application and environment. Some of these innovative technologies include
Active Context Tracking - ACT technology uses lightweight agents to track cross-tier context and the resource utilization of each transaction across the entire datacenter.
APIs – C, C++, C#, or Java APIs for monitoring transaction flow through proprietary and homegrown components.
Network packet capture – Non-intrusive agents capture all network traffic to identify and measure transactions from the network perspective.
Real-time log parsing – For platforms that are not optimal for instrumentation, real-time log file analysis is used to identify information about transaction flow. The log analysis is seamlessly integrated into the transaction model with no overhead
Passive Context Tracking - When a transaction crosses a component that cannot be monitored (for example, a security appliance that cannot be instrumented and does not have logs), passive context tracking is used to stitch together the flow using an ID common to the different parts of the transaction.
When a transaction enters the system, it is identified by one of the agents and then undergoes classification and analytical processing. An effective production-ready APM solution continues to track the transaction as it traverses web, application, middleware, and database tiers, while collecting performance and resource consumption metrics at each tier. Even when a transaction makes a call to a tier that isn’t monitored, metrics such as the number of calls and the time spent on that tier are captured.
As we discussed in parts one-three, success in production is contingent upon continuously monitoring as much as possible to derive intelligence right when it’s needed. The capabilities listed above, out of the box, are important part of the always-on discovery and tracking approach to APM. It is important for the speed of deployment and value provided by the APM solution. If you have wide gaps in coverage, you simply miss too much and have too many blind spots in your monitoring capabilities.
So, that’s the fourth critical requirement. Coming up next, and closing out the series, is enterprise scalability.
Five Keys to Success with APM in Production Environments – Real-Time APM (Part 3 of 5)
By Diego Lomanto (Twitter: diego_lomanto)
This is the third of a five part series where we explore the critical factors of implementing APM in production environments successfully. You can find parts one and two here. Please check back next week for part four.
In this series we are discussing how the Gartner Magic Quadrant provides a great start to implementing with APM solution. However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation. Capabilities that don’t get as much coverage in the media. They are:
- Continuous monitoring, NOT exception-based monitoring
- APM analytics that enable you to become more proactive with application/transaction data
- Smart alerting and real-time analytics impact
- Broad platform support eliminating all blind spots in your monitoring strategy
- Enterprise readiness for growth and scalability
Part 3 – Real-time Application Performance Monitoring Analytics
In part two of our series, we explored the value of APM analytics to conduct historical analysis of application performance in the short and long-term. Now let’s look at how critical it is assess application performance in real-time production environments to ensure success.
Real-time APM analytics allows you to make quick decisions about applications to improve performance, while understanding and quantify the business impact of your actions. A real-time analysis is typically triggered by a real-time alert from the APM solution, or real-time dashboard within the solution then followed up by a deep-dive into the data using real-time OLAP tools.
Real-time Alerts
APM Alerts can come via e-mail, text or even through a dedicated smartphone application like the one pictured below.
Most APM solutions provide alerts. However, look for one that integrates a “Complex Events Processing” (CEP) engine into the alerting algorithm in production. Without CEP your alerts from production environments often are without significance. CEP engines combine multiple events to generate an alert, giving you a thinking engine that can make correlations between different events and tell you something is gone wrong – even if it’s not apparent at first glance. Most application monitoring tools only generate alerts based on IT events, not business events
Real-time Dashboards
Dashboards monitor SLAs or other relevant key metrics that assess the health of your systems. Here is an examples of an APM real-time dashboards in action:
Common Real-Time Triggers
Here is an example of how an alert can be constructed in an APM solution with a complex events processing engine. In this scenario, all the transactions are being analyzed and the APM solution triggers an alert when business process SLA’s are breached. This is important because transaction SLA’s are set at the technical system level, not at business level. So while the transactions may not have breached their SLA, the business process is suffering because of poor performance of many transactions.
Other common triggers:
- Database CPU spikes caused by users with older browsers
- Search responding slowly to one specific query that was not optimized
- Chatty transactions inhibiting cloud-based transaction performance
- Imbalanced cluster of middleware tiers
Real-Time OLAP Engines
Once an alert is triggered or a dashboard reveals a problem in real-time, the next step is to slice and dice and look at the data from different points of view to isolate where the problem is coming from in real time. With real-time OLAP engines, you can view the transaction path, tier performance, end-user perspective, in order to isolate issues. just to name a few. . This is the true power of real-time analytics in production environments. And, it’s the difference between IT saying all KPIs are green and business saying orders falling off.
Some common real-time OLAP findings:
- Business process completed/failed/not completed
- Users were impacted by the current outage
- Locations were impacted by the current slowdown
- Unauthorized users accessing the application
- Topology changes
- Load balancing
- Slow databases
- High CPU resource consumption on specific tiers
- Web services provider underperforming
Tying it all Together
Armed with this knowledge, you can now go solve the problem. You can even assign business impact to the real-time analysis to prioritize your actions. For example, through real-time OLAP you can discover that two transactions are beginning to fail. One of the transactions is responsible for $1M a minute in revenue, and the other is worth $10k. It’s now easy to figure out what to solve first.
This type of analysis is crucial to operating an APM solution in a production environment effectively. With so much data available to analysts, it is imperative that not only can they make sense of it, but that they have access to the information as problems arise. Through a combination of alerts, dashboards and OLAP engines users can effectively monitor their infrastructure proactively.
Next week we’ll discuss broad platform support in part four of the series.
Five Keys to Success with APM in Production Environments – APM Analytics (Part 2 of 5)
By Diego Lomanto (Twitter: diego_lomanto)
This is the second of a five part series where we explore the critical factors of implementing APM in production environments successfully. You can find part one here. Please check back next week for part three.
In this series we are discussing how the Gartner Magic Quadrant provides a great start to implementing with APM solution. However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation. Capabilities that don’t get as much coverage in the media. They are:
- Continuous monitoring, NOT exception-based monitoring
- APM analytics that enable you to become more proactive with application/transaction data
- Real-time monitoring for proactive APM analysis
- Broad platform support eliminating all blind spots in your monitoring strategy
- Enterprise readiness for growth and scalability
Over these five blog entries I’ll spend a little bit of time on each of these success factors so you can be sure that you purchase a solution that will deliver the results you expect, not just in development and testing environments but also in production.
Part 2 – APM Analytics
Last week we talked about the virtues of a continuous monitoring strategy at length. But now that we can see everything, we’re going to have to find a way to make sense of it. A major risk for APM solutions in production environments is that they simply overwhelm the end-user with data, or the opposite occurs. They don’t provide enough actionable intelligence. It’s just hard to manually determine what’s important.
This is where APM analytics comes into play. Analytics should not be an optional component of APM – it is vital to fulfill the promise of APM. It enables you to analyze application performance in ways previously impossible or requiring massive amounts of work. And, analytics makes APM accessible to the enterprise.
Types of Analysis
The easiest way to understand APM analytics is to look at the use cases. The common use cases of analytics for APM are real-time, short-term analysis and long term planning.
Real-time (e.g. A product server is about to violate an SLA)
- Real-time OLAP
- Alerting to isolate problems while they are happening in transactions, infrastructure and business process.
In the Short-term (Why were transactions 10% slower today?)
- Business event correlation for root cause analysis
- Capacity management
In the Long-term (What applications can we move to the cloud?)
- Improving user and application behavior
- Capacity planning
- Cloud architecture planning
Here are some of the common types of reports you will get:
Real-time analytics is a topic that deserves its own post, so I’ll cover that next week in detail. In this post we’ll focus on the short and long term use cases.
What Makes APM Analytics Work in Production?
Ok, sounds good so far, right? There’s a gotcha. You knew there would be! In order to provide right amount of actionable intelligence in a production environment, you must first start with good data. The concept of “garbage-in, garbage-out” hold very much true for APM analytics. Here’s the secret to good APM data: entity relationships.
Entity relationships hold information about the interaction of a transaction with other components of the infrastructure. (E.g. this transaction was in this tier for this long before moving to that tier). Entity relationships are crucial to APM analytics because they allow you to infer root cause. Most APM solutions cannot provide detailed entity relationships in production because they do not track all the tiers and they do not track all transactions. This all goes back to the continuous monitoring requirement from last week. You might be starting to see that the keys to success with APM in production are related to each.
Ok, Sounds Good. How About Some Examples?
Sure thing. At OpTier, we call the customizable part of our APM analytics Business Events and we’ve helped customers use it to detect the following:
- Poorly designed SQLs as the root-cause of slow transactions
- ESB wrongly orchestrating transactions
- Retail banking payment transactions traversing certain application components before final booking
- Trading transactions having specific cut-off times in the day
- Order fallouts (common for telcos)
- Resource-intensive batch tasks impacting online transaction activities
- Specific users impacting system performance
Let’s take a deeper dive. Here’s an example of APM analytics uncovering the root cause of where transactions are failing in the short term. In the screenshot we below are analyzing transaction flows and can see that there is a missing step in the overall process flow: “Send Invoice.”
The APM solution can detect and report “Send Invoice” as a root cause because of the entity relationships. There are relationships between tiers in a transaction flow, and when the system can understand that it can start to detect when those relationships start to change. The next step here is for an analyst to look at the invoicing system and determine why that step in the process is not occurring. This improves mean time to resolution, as the analyst is not forced to look at every tier, just the problematic ones. He or she can then get the issue over to developers to fix faster than they could have before APM analytics was available in production.
That is just one example of the power APM analytics in a production environment. Because of the depth of information, APM analysts need a way to parse through the volume to get to the root causes. Analytics is the key to delivering this success in production
What do you think? Have you come across any other good examples of analytics? I’d love to hear some of your stories.
Stay tuned for the next installment of this series where we will discuss leveraging real-time analysis to proactively monitor applications.for the third part of this series. If you’d like to be notified when the post goes up please follow me on twitter @diego_lomanto.
Five Keys to Success with APM in Production Environments – Continuous Monitoring (Part 1 of 5)
By Diego Lomanto (Twitter: diego_lomanto)
This is the first of a five part series where we explore the critical factors of implementing APM in production environments successfully. Please check back next week for part two.
If you are currently evaluating an Application Performance Management (APM) solution you probably realize by now there are several capabilities that must be included in order to maximize the value of APM. Gartner summed these up nicely in their recent magic quadrant report. Dynamically generated topology maps, application diagnostics, transaction monitoring, end user experience, and reporting capabilities have become the table stakes for APM these days. I talked a bit about using these dimensions to take a business transaction-driven approach to APM in my last post.
These dimensions are the baseline requirements when considering an APM solution. However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation. Capabilities that don’t get as much coverage in the media. They are:
- Continuous monitoring, NOT exception-based monitoring
- APM analytics that enable you to become more proactive with application/transaction data
- Real-time monitoring for proactive APM analysis
- Broad platform support eliminating all blind spots in your monitoring strategy
- Enterprise readiness for growth and scalability
Over the next five blog entries I’ll spend a little bit of time on each of these success factors so you can be sure that you purchase and deploy a solution that will deliver the results you expect not just in development and testing environments but also in production. Let’s start with continuous monitoring:
Part 1 - Continuous Monitoring, NOT Exception-Based Monitoring
The first entry in this series deals with the value of enabling a continuous monitoring solution rather than an exception based one. Many APM solutions have trouble dealing with high-volume environments so they function in a passive mode, tracking mostly high-level metrics and basic KPIs, waiting for a pre-defined exception to occur. Only then is a more active monitoring mode is entered. Tier metrics are not a reflection of transaction health and have little to do with the end-user experience.
On the other hand, continuous monitoring solutions were built from the ground up with lower overhead so that they could run 24×7 on all transactions with low overhead. We recommend a continuous approach in your production environment. Here’s the rationale:
The Risk in Production with Exception-Based Solutions
There are a few problems with exception-based solutions:
- Does not surface problems you haven’t defined as a breach in advance. This is the main problem with an exception-based solutions. If the administrators of the system have accurately planned for all of the breaches that might occur, then might be able to get data on problems within the environment. But what if the breaches are not well-defined? You end up with blind spots. Everything looks fine because no red flags are getting reported. But is that the reality? How do you know if you can’t see everything?
- Frequent smaller problems fall between the cracks because they occur sporadically and not consistently enough for the tool to decide that it is an “exception”. However, all of these small problems often add up to poor end-user experience. And even if such breaches do trigger the exception mechanism, what happens if it does not occur again while the exception based tool is watching? Nothing gets reported.
- Monitoring uncovers no problems because the issue occurred already and the system has returned to normal state. And as soon as it goes back to passive mode the problems arise again, triggering the exception but no meaningful data. You end up going around in circles and never truly resolving the problems.
What’s happening here is that exception-based solutions leave you with too many blind spots to manage application performance effectively.
Exception-based tools work this way in production to minimize their overhead and the amount of data that they capture. These tools were designed for helping developers debug their code, not for 24/7 production use, so they are not able to monitor and analyze millions of unique activities every day. They have to apply some sort of a selection mechanism to decide what to monitor and what can be ignored.
How Does Continuous Monitoring Help?
To deal with all future problems you need to be able to see everything. You need to know what happened before the problem occurred and understand what’s happening right now. You need to know what is considered normal. Otherwise, how do you know what is abnormal? Sometimes the problem is simply not definable in advance and flies under the radar of exception-based solutions. For example, if an important database table gets deleted by accident, application performance might actually look to be improving. Exception-based solutions might not notice anything was wrong even though from the end users’ perspective all the data is gone. This is a full-blown application outage.
Here’s what an effective continuous monitoring solution will do for you:
- Discovers, classify and track all business transactions across multiple tiers and components.
- Identify the exact performance details at each step that the application executes in order to quickly isolate problems.
- Alert IT staff to developing service disruptions and anomalies long before they are detected by end users.
- Enable IT to proactively manage application performance and prevent service level degradation or interruptions to business services.
- Monitor transaction that had not been defined up-front as “transactions of interest”.
The diagram below depicts a dynamically generated topology map from a continuous monitoring solution that has automatically, and without any input from systems administrators, detected the true architecture of the application environment – including tiers that may be unexpectedly part of the transaction flow.
That’s a powerful capability that you can’t get with exception-based technology. Another example of where exception-based monitoring would fail is the common situation of a batch job or some other nightly activity that accidentally got kicked off in the middle of the business day. Such nightly processes often hammer the databases as they perform complex calculations and produce detailed reports. When running in the middle of the day, they will slow down other transactions that are also trying to access the databases.
What would an exception-based solution do? At best, it will show that online transactions are slowing down, CPU and activity levels are high, and some systems may be running close to capacity, but it will not point to the offending batch job as the root-cause because batch jobs are not among the business activities that had been defined upfront for monitoring. The Operations manager might conclude that it is time to upgrade the hardware (because it is getting close to capacity in the middle of the day) without realizing that the hardware is just fine and the real issue has to do with a job scheduling error.
Those are just a few examples of the power of continuous monitoring in a production environment. For more you can visit the OpTier site. What about you? Have you come across any other good examples of a continuous monitoring solution detecting problems that would have been missed by an exception-based methodology? I’d love to hear some of your stories.
I’ll be back next week to discuss leveraging APM analytics to uncover root cause for the second part of this series. If you’d like to be notified when the post subscribe to our feed, click on the twitter button at the top of the page, or follow me on @diego_lomanto.
What is Business Transaction-Driven Application Performance Management?
By Diego Lomanto (Twitter: diego_lomanto)
If you are in IT operations, or manage business applications, you are probably starting to hear the term “Business Transaction-Driven Application Performance Management“ more often. At OpTier, business transaction management is our core approach to APM so I thought I’d put together a post about what this term means for those looking around the web for more information. Let’s start with a definition:
Business transaction management is an approach to application performance management (APM) that puts the transaction as the foundation for all other dimensions of the APM model.
What does this mean and why would you do this? By taking a business transaction-driven approach to APM, you can uncover the dynamic application performance variations that occur due to the ever increasing distributed nature of tiers in today’s modern IT infrastructure. Your web server is hosted here, but your database lives there….oh wait there’s some interaction with a mainframe that is housing code written in the 70s. Applications no longer operate in self-contained environments – and they haven’t been for a long time. But the increased adoption of technologies such as the cloud over the last few years has accelerated the complexity. You need to find a method of monitoring that can traverse across all of the tiers. And that’s where the transaction comes into play.
Business transactions are both the services our users consume of our IT applications and the singular activity that crosses all tiers to provide that service. And if we could find a way to have that transaction update us on its health and performance as it does its work from tier-to-tier, then we can get the most accurate picture of application performance. That’s exactly what business transaction-driven APM does.
To understand what value that brings to managing enterprise applications, let’s look at the dimensions of APM and how each dimension can be improved by using business transactions as the foundational component. I’ll use Gartner’s recently published 2011 Magic Quadrant for APM as the source and definition for each dimension, which Gartner described as “Five distinct dimensions of, or perspectives on, end-to-end application performance have been assembled by market participants, each one essential and complementary to all the others.”
End-user experience monitoring
Gartner definition: “The capture of data about how end-to-end application availability, latency, execution correctness and quality appeared to the end user”
Additional Value of a Transactional Foundation: The transaction begins here. By measuring application performance from the end user’s perspective, 24/7 and 100% of the time, change-impact analysis shows managers how a certain change at a given time has impacted the user experience providing a rich end-to-end analysis.
Runtime application architecture discovery, modeling and display
Gartner definition: “The discovery of the software and hardware components involved in application execution and the array of possible paths across which these components could communicate to enable that involvement”
Additional Value of a Transactional Foundation: With a transaction foundation, topology maps are derived from the true transaction path through distributed tiers. It is impossible to generate accurate application architecture discovery without a transaction-driven approach. With such an approach not only do we achieve a living topology view of dependencies but we achieve it without the need to model!
User-defined transaction profiling
Gartner definition: “The tracing of events as they occur among the components or objects as they move across the paths discovered in the second dimension; this is generated in response to a user’s attempt to cause the application to execute what the user regards as a logical unit of work”
Additional Value of a Transactional Foundation: The foundational concept that enables the transaction foundation. By tracing every transaction starting at the end user (see experience monitoring above) a seamless view of transaction is achieved from users, across datacenters and into clouds. In addition to providing topology maps, a business transaction approach also measures the performance and resource footprint at each tier that the transaction passes through to give you more command and faster resolution of problems.
Component deep-dive monitoring in application context
Gartner definition: “The fine-grained monitoring of resources consumed by and events occurring within the components discovered in the second dimension “
Additional Value of a Transactional Foundation: A business transaction-driven approach helps IT determine which application components actually need deep-dive assistance. Without it, APM tools require the user to tell them what to look for and where to look for it. This can be extremely difficult with such complexity in the application environment and with so many different people involved in managing applications and infrastructure. Moreover, when the application code changes over time (in today’s agile environment happens very frequently), the configuration of deep dive tools needs to be updated.
Gartner definition: “The marshaling of techniques, including behavior learning engines, complex-event processing (CEP) platforms, log analysis, and multidimensional database analysis to discover meaningful and actionable patterns in the typically large datasets generated by the first four dimensions of APM”
Additional Value of a Transactional Foundation: Once again, using a transactional foundation delivers real-time, cross-tier visibility into the relationships between user actions, application behaviors and infrastructure behavior even when complex business transactions including multi-segment transactions that flow through multiple platforms and locations are involved. As opposed to combining siloed data this transactional approach provides analytics that a far more effective, intuitive and efficient to use to achieve proactive control over application performance.
Source: Magic Quadrant for Application Performance Monitoring, September 2011, Will Cappelli, Jonah Kowall
This business transaction-driven approach to APM is what we do at OpTier and we believe that it is changing the way IT manages applications. Hope this helps you understand the term a little bit better!






