Author Archive

Overcoming the Complexity of Big Data with Big Transaction Data

By Diego Lomanto

For most companies, the challenge with big data lies in making sense of the data acquired in order to apply it to real world problems when decisions matter most.  Big data is hot right now because we recognize that we are generating more data than ever before and that we might be able to do something with it.  However, much the execution of big data has been around storage of the data (think Hadoop) and search (think Splunk).  That’s a great start, but do they really solve any problems in a new way on their own?

Start a big data project and you will soon realize that the data itself is limited because it is partial (takes whatever is available), difficult to consume for analysis (because it’s unstructured) and often offers limited value use cases.  It’s complicated.

I think the evolution towards better value from the data is still in progress.  I think we’ll not only see continued progress in storage but I believe that technology will emerge to make working with big data feel a wee bit smaller.  What I mean by that is we’ll still collect the data at massive scales, but there will be technology that simplifies the big data into a model that is consumable by analytic applications.  In other words, it will transform the data to actually represent something that can be analyzed.

Big Transaction Data

Big Transaction Data (BTD) is a great example of this.   It is complete, comprehensive and correlated.  But it’s also usable.  Let’s have a quick primer on BTD.

What it is, effectively, is the data generated by transactional systems in raw form modeled to represent the unique end-to-end transaction that drove the data generation in the first place, and stored alongside millions, billions, trillions (insert your own “illion” here) of other transactions.  This is done by technology – typically business transaction management software that observes and reports on transaction performance at each tier.

This is REAL big data in action.  And that’s where business transaction data comes into play.  BTD takes the data and stores it in a consumable form for analytics.  The transaction becomes the anchor for the analytics process.

The Problem with Fragmented Data

For example, say you wanted to analyze the end to end process performance of a financial trade system.  The systems that execute financial trades are ridiculously complex.  Think of the most complex system you can think of  and then multiply it by 3.  Why?  Because they are using a mix of new and old technologies and it’s distributed across multiple tiers and managed by many different stakeholders.  So what you get his this hodgepodge of tiers to execute trades that is incredibly difficult to rationalize into a singular data set.  The unfortunate by-product of this is that your view of the trade transaction is really just fragmented data.  You can see pieces of the transaction performance but not really ALL of the transaction.

But, you still need to analyze trades across the tiers and processes as a single input into your trade effectiveness analysis.  So you do the best you can.  You go deep into the tier data and try to correlate it on your own within your own analytic model.  For example, you try to monitor cross-process fallout with a cool looking dashboard that gives you data on each process, but you don’t really do it well and miss a lot of cross-process issues.

Or you try to do a cost analysis.  Or a segmentation analysis.  Or a performance analysis.  But the work to create a singular data set is so complicated that you never really have full confidence in the results.

Big Transaction Data in Action

Here is a great opportunity to employ big transaction data.  Instead of working with billions of manually correlated data points, let’s simplify and work with millions of well-defined transactions instead.  End-to-end transactions that represent each trade across each process in full.  Now you have a data set that you can inject it into your BI platform or use simply use BI tools within the big transaction data solution itself for analysis.

So back to those 3 Cs.  The data is complete – that means all information is generated by BTM end to end one view.  It’s comprehensive – capturing ALL interactions. And, it’s correlated – it knows everything about vital meta-data such as user, tiers, etc. The result is easy to consume meaningful analytics leading to business outcomes.

So, big data is hot.  But it’s not quite there yet.  We’re waking up with more data but we’re still working to rationalize it.  Fortunately, the technology is on its way to simplify and gain more (true) value from big data.

May 9, 2012 at 9:11 am Leave a comment

The Key to Next Generation APM: Dynamic Topology Maps in Action

By Diego Lomanto and Iddo Avneri 

I recently heard a customer that manages applications for a Fortune 50 bank say in a meeting, “Before OpTier we needed almost 20 people to isolate the problem location, now I can do it in 5 minutes by myself”.  He attributed the breakthrough to our dynamic topology mapping functionality that is a core capability of Always-on APM.   I have heard comments like this about dynamic topologies quite often, so I thought it would be neat to document how and why this happens.  I went to our Director of Presales, Iddo Avneri – the expert on how our product works in the field and he provided some amazing examples of dynamic topologies in action (thanks Iddo!).  That research led to this blog post.

Before I get into what I found, let me quickly summarize what dynamic topology maps (or “living topologies” or “business transaction maps” as some people like to call it) are and how they are used in a BTM-driven APM solution. Since OpTier’s Always-on APM tracks transactions in real time, across distributed application tiers, it discovers the flow topology for each transaction as a byproduct of the tracking and can display it visually. This way, it is possible to easily see valuable information such as the different paths transactions of the same type take, whether the transaction followed its expected path and did not deviate to unexpected databases, applications, or legacy servers, where transaction flow breaks and so on.

Let’s imagine you were to draw a diagram of your application architecture at any point in time.  It could look something like this:

One key problem with the drawing above is that it’s static.  That’s a moment in time.  There are tools out there which can do this for you, but because they are not using transactions as the method for tracing the topology, they do not update in real-time or give you good information around the relationship between the tiers.

That’s where a dynamic topology map comes into play. The image below shows an example of the flow topology of one type of a business transaction as discovered and visualized by a transaction-driven approach to APM.

As you can see, this mapping looks a lot like the hand-drawn version.  However, it’s dynamic –it’s been automatically been derived from the actual transactions that flowed thru the app in this period.  This means that the next time you look at it, if the infrastructure or app evolves and changes – this picture will evolve and change as well, and you don’t need to do anything.  Or it could simply be that transactions types not used during the first period will be invoked.   The system automatically detects that.

Another great advantage of this approach is that it provides details on the health and performance of tiers from a transaction perspective – i.e. how did the tier impact the transaction tracked as it went through, and also provides insight into the relationship between tiers in the transaction context – i.e. which transactions use the LDAP connection between tier A and the SiteMinder server.  In the picture above you can see some tiers are green and some are red – these colors are indications of service level performance.  And the lines between tiers show how much transaction volume flows between the tiers. So how are these maps being used today?

The reason that we can do this is because of the transaction data we generate.  This data has many uses, but one of the keys is how simple it makes application performance management.  As the transaction passes each tier, data is sent to a centralized server for analysis.  The end result is this map.  You can only do this if you take a transaction approach.

Users of Dynamic Topologies

Let’s start with who is using it.  It starts with the folks in the service operations center.  A dynamic topology gives them view of the overall health of the infrastructure and how it relates to the application ervices being delivered by it.  And if SLAs are in danger of being breached, and alerts start to go off indicating the degradation, they know exactly where in the topology to go look deeper and who to involve.  In addition, performance optimization analysts and applications support use dynamic topologies to find ways to improve performance. QA Teams can compare different load tests and see if any tier is performing badly, or even – are all the tiers and the entire transaction path is being tested in a load test?  And perhaps once of the most interesting uses of dynamic topologies is by the application support team to eliminate all-hands calls. To me, that was one of the neatest benefits that I saw.  Instead of getting 30 people on one call and going through a painful roll call to diagnose the problem (“no problem here!”) you can just go directly to the managers of the tiers that are in the red.

Deployment of Dynamic Topologies

The key to deploying this technology is that dynamic topologies represent automatically discovered relationships.  All you need to is instrument the tiers (a simple agent install on the key tiers – not all of them either) but you don’t need to tell the technology anything about the relationship between tiers.  That’s the magic.  The agents automatically detect those relationships by tracking the transactions that flow across the tiers end to end – the way Fedex tracks packages.  Here for example is a screenshot of a UAT environment in which an application makes calls to a production database as discovered by transactions making those calls.

Reading Dynamic Topologies

Dynamic topologies provide a few crucial pieces of information just by looking at them.

  • Coupling between tiers – the thickness between tiers representing how many transactions go through relative to other flows
  • Chattiness between tiers – indicated by color – with red indication chatty (as in potentially bad ) behavior

In the screenshot below you will see that customer portals 1 and 2 are very chatty with the authentication server. OpTier’s Always-on APM identified single executions on the application servers that made hundreds of authentication calls because of a misconfigured Single-Sign-On mechanism. As a result, users suffered very bad response times for specific types of transactions.

Another common example is transactions that make many connections to the database. In the below screen shot, we display a single invocation of a transaction (an instance), every arrow between the portal application server and the reporting server displays a connection from the shared connection pool. Obviously when a single transaction abuses a shared connection, the rest have to wait for it to become free:

What you will also notice in the screenshot is observed tiers (they are marked by the open door).  You don’t need to instrument (put agents on) your entire environment. Even without installing on all tiers, a dynamic topology will automatically include calls to non-instrumented tiers and show valuable information such as the amount of time spent on those tiers, error information and more.  For specific tiers such as database you can even see more, for example you can show connection pool utilization and long running SQLs.

Here is an example where the observed tier capabilities helped detect a major application configuration error. Once installed in the UAT environment at this customer site the customer noticed the application server making calls to the production database. While trying to reproduce a problem from production in UAT the application server configuration was copied over from production, including the data source target:

Drilling Down from Dynamic Topologies

What’s neat about dynamic topologies is that they immediately indicate SLA breaches by color and you can isolate the problems for immediate resolution by drilling down using your deep dive tools (already included in the OpTier solution if you don’t have one).  In the screenshot below we see that the compliance web service is breaching its SLA because it is red.  This information tells us:

  • what application is not working
  • the transaction types affected
  • the transaction instances affected (there could be many instances of a specific type of transaction)
  • who is being affected (users , locations…)

Another neat example is usage for load balancing.  In the screenshot below we see that there are 3 Jbosses in a cluster and one of the instances handles only an uneven distribution of 12% of the traffic.  Another common use case is for configuration issues. For example, if you have 4 web logic servers that receive an even load, but 3 have an average response time of 5 milliseconds but one of them has a response of 35 milliseconds.  There’s a configuration issue.  That information is all right there in the dynamic topology!

So, if you are looking for a solution that can help you get a better sense of the dynamic infrastructure that is your application environment, it’s very helpful to take the dynamic topology approach that is facilitated by transaction analysis.  By following transactions through the entire backend architecture, dynamic topologies can give you high level performance overviews, detect hotspots in testing that need to be dealth with before they impact you in production, as well as give you the detailed information that helps you avoid or resolve any problems that arise in production.  What about your experiences?  We’d love to hear some more stories from your environments.

April 26, 2012 at 1:15 pm Leave a comment

Five Keys to Success with APM in Production Environments – Enterprise Scale and Readiness (Part 5 of 5)

By Diego Lomanto (Twitter: diego_lomanto)

Welcome to the concluding post in our series on the Five Keys to Success with APM in Production Environments.  In this series we have been discussing how the Gartner Magic Quadrant provides a great start to implementing an APM solution.  However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation.   To refresh your memory, so far we’ve covered:

Now, let’s talk about enterprise scale and readiness

Enterprise Scale and Readiness

If you can check off everything above – you’re in good shape.  But there’s one more factor to success with APM in production.  When you roll out your investment, will it be able to handle the scale of your business today and when it grows?  If you are investing in APM, then you probably have a high-volume of critical transactions to analyze.  Your solution must be able to handle hundreds of millions of transactions per day.  It cannot fail just as transactions sharply rise.  These are precisely the times you need APM!  It goes back to the whole continuous monitoring vs exception-based monitoring argument and ensuring that the solution does not fail at crunch time, or work around limitations by limiting scale of deployment.

You must also be sure that the solution itself can scale easily as your environment grows.  IT is fluid, and APM must be as well.  It’s not just about number of transactions either.  Determine if it’s easy to add tracking on new tiers or is that a whole project unto itself?  Make sure your vendor provides a path to application monitoring expansion that is manageable.

Speaking of manageability, some other key requirements are that the solution should not have a single point of failure, should be able to be remotely configured and should have a high availability.  In addition, whatever method of tracking employed should have low overhead so you that it doesn’t crash once you put it into production.  This is typically a problem for exception-based solutions so keep your eyes on that.

Now, You are Prepared

Ok, now we you have a better handle on the ins and outs of APM in production.  It’s important that you pick a product that not only includes the five dimensions of APM, but also can run effectively in production and provide the right information to you at the right time.  If you look for these requirements during the research phase, you’ll have a much better experience during roll out.  And your solution will provide much more value, both for IT and the enterprise.

Thanks for tuning in during this series.  It’s been great walking through these critical success factors and judging by the traffic it’s helped a few people.  If you have any comments, I would love to hear them in the comment box below, or on twitter @diego_lomanto.

diego_lomanto

January 17, 2012 at 4:08 pm 1 comment

Five Keys to Success with APM in Production Environments – Broad Platform Support (Part 4 of 5)

By Diego Lomanto (Twitter: diego_lomanto)

This is the fourth of a five part series where we explore the critical factors of implementing APM in production environments successfully.  You can find partsone , two and three here.  Please check back next for part five. 

In this series we are discussing how the Gartner Magic Quadrant provides a great start to implementing with APM solution.  However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation.  Capabilities that don’t get as much coverage in the media. They are:

Part 4 – Broad Platform Support to Eliminate Blind Spots in your Monitoring Strategy

Your APM solution must be adaptable to support everything you have within your environment.  This includes support not just for application processing tiers, but also databases and middleware, as they are often the cause of performance problems.  A complete APM solution should support a diverse mix of SOA, private and public cloud, middleware, databases, homegrown, legacy and proprietary technology stacks. A solution that will thrive in production should also support applications of unknown design/no access to code.  This provides IT with visibility into 3rd-party applications’ and components’ performance and reduces the time to identify performance issues associated with 3rd party applications and components.

At the heart of broad platform support is the capability to  track each and every transaction instance through its entire life cycle. To enable this, there are multiple techniques to track transactions across virtually any application and environment. Some of these innovative technologies include

Active Context Tracking - ACT technology uses lightweight agents to track cross-tier context and the resource utilization of each transaction across the entire datacenter.

APIs – C, C++, C#, or Java APIs for monitoring transaction flow through proprietary and homegrown components.

Network packet capture – Non-intrusive agents capture all network traffic to identify and measure transactions from the network perspective.

Real-time log parsing – For platforms that are not optimal for instrumentation, real-time log file analysis is used to identify information about transaction flow. The log analysis is seamlessly integrated into the transaction model with no overhead

Passive Context Tracking - When a transaction crosses a component that cannot be monitored (for example, a security appliance that cannot be instrumented and does not have logs), passive context tracking is used to stitch together the flow using an ID common to the different parts of the transaction.

When a transaction enters the system, it is identified by one of the agents and then undergoes classification and analytical processing.  An effective production-ready APM solution continues to track the transaction as it traverses web, application, middleware, and database tiers, while collecting performance and resource consumption metrics at each tier. Even when a transaction makes a call to a tier that isn’t monitored, metrics such as the number of calls and the time spent on that tier are captured.

As we discussed in parts one-three, success in production is contingent upon continuously monitoring as much as possible to derive intelligence right when it’s needed.  The capabilities listed above, out of the box, are important part of the always-on discovery and tracking approach to APM. It is important for the speed of deployment and value provided by the APM solution.  If you have wide gaps in coverage, you simply miss too much and have too many blind spots in your monitoring capabilities.

So, that’s the fourth critical requirement.  Coming up next, and closing out the series, is enterprise scalability.


January 5, 2012 at 8:25 pm 3 comments

Five Keys to Success with APM in Production Environments – Real-Time APM (Part 3 of 5)

By Diego Lomanto (Twitter: diego_lomanto)

This is the third of a five part series where we explore the critical factors of implementing APM in production environments successfully.  You can find parts one and two here.  Please check back next week for part four. 

In this series we are discussing how the Gartner Magic Quadrant provides a great start to implementing with APM solution.  However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation.  Capabilities that don’t get as much coverage in the media. They are:

Part 3 – Real-time Application Performance Monitoring Analytics

In part two of our series, we explored the value of APM analytics to conduct historical analysis of application performance in the short and long-term.  Now let’s look at how critical it is assess application performance in real-time production environments to ensure success.

Real-time APM analytics allows you to make quick decisions about applications to improve performance, while understanding and quantify the business impact of your actions.  A real-time analysis is typically triggered by a real-time alert from the APM solution, or real-time dashboard within the solution then followed up by a deep-dive into the data using real-time OLAP tools.

Real-time Alerts

APM Alerts can come via e-mail, text or even through a dedicated smartphone application like the one pictured below.

APM real-time alerts sent to a smartphone

APM real-time alerts sent to a smartphone

Most APM solutions provide alerts.  However, look for one that integrates a “Complex Events Processing” (CEP) engine into the alerting algorithm in production.  Without CEP your alerts from production environments often are without significance.   CEP engines combine multiple events to generate an alert, giving you a thinking engine that can make correlations between different events and tell you something is gone wrong – even if it’s not apparent at first glance. Most application monitoring tools only generate alerts based on IT events, not business events

Real-time Dashboards

Dashboards monitor SLAs or other relevant key metrics that assess the health of your systems.  Here is an examples of an APM real-time dashboards in action:

APM real-time dashboards

APM real-time dashboards

 Common Real-Time Triggers

Here is an example of how an alert can be constructed in an APM solution with a complex events processing engine.  In this scenario, all the transactions are being analyzed and the APM solution triggers an alert when business process SLA’s are breached.  This is important because transaction SLA’s are set at the technical system level, not at business level.  So while the transactions may not have breached their SLA, the business process is suffering because of poor performance of many transactions.

Other common triggers:

  • Database CPU spikes caused by users with older browsers
  • Search responding slowly to one specific query that was not optimized
  • Chatty transactions inhibiting cloud-based transaction performance
  • Imbalanced cluster of middleware tiers

Real-Time OLAP Engines

Once an alert is triggered or a dashboard reveals a problem in real-time, the next step is to slice and dice and look at the data from different points of view to isolate where the problem is coming from in real time. With real-time OLAP engines, you can view the transaction path, tier performance, end-user perspective, in order to isolate issues.  just to name a few.  .  This is the true power of real-time analytics in production environments. And, it’s the difference between IT saying all KPIs are green and business saying orders falling off.

Some common real-time OLAP findings:

  • Business process completed/failed/not completed
  • Users were impacted by the current outage
  • Locations were impacted by the current slowdown
  • Unauthorized users accessing the application
  • Topology changes
  • Load balancing
  • Slow databases
  • High CPU resource consumption on specific tiers
  • Web services provider underperforming

Tying it all Together

Armed with this knowledge, you can now go solve the problem.  You can even assign business impact to the real-time analysis to prioritize your actions.  For example, through real-time OLAP you can discover that two transactions are beginning to fail.  One of the transactions is responsible for $1M a minute in revenue, and the other is worth $10k.  It’s now easy to figure out what to solve first.

This type of analysis is crucial to operating an APM solution in a production environment effectively.  With so much data available to analysts, it is imperative that not only can they make sense of it, but that they have access to the information as problems arise.  Through a combination of alerts, dashboards and OLAP engines users can effectively monitor their infrastructure proactively.

Next week we’ll discuss broad platform support in part four of the series.

December 27, 2011 at 3:12 pm 3 comments

Five Keys to Success with APM in Production Environments – APM Analytics (Part 2 of 5)

By Diego Lomanto (Twitter: diego_lomanto)

This is the second of a five part series where we explore the critical factors of implementing APM in production environments successfully.  You can find part one here.  Please check back next week for part three. 

In this series we are discussing how the Gartner Magic Quadrant provides a great start to implementing with APM solution.  However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation.  Capabilities that don’t get as much coverage in the media. They are:

Over these five blog entries I’ll spend a little bit of time on each of these success factors so you can be sure that you purchase a solution that will deliver the results you expect, not just in development and testing environments but also in production.

Part 2 – APM Analytics

Last week we talked about the virtues of a continuous monitoring strategy at length.  But now that we can see everything, we’re going to have to find a way to make sense of it. A major risk for APM solutions in production environments is that they simply overwhelm the end-user with data, or the opposite occurs.  They don’t provide enough actionable intelligence.  It’s just hard to manually determine what’s important.

This is where APM analytics comes into play.  Analytics should not be an optional component of APM – it  is vital to fulfill the promise of APM.  It enables you to analyze application performance in ways previously impossible or requiring massive amounts of work.  And, analytics makes APM accessible to the enterprise.

Types of Analysis

The easiest way to understand APM analytics is to look at the use cases.  The common use cases of analytics for APM are real-time, short-term analysis and long term planning.

Real-time (e.g. A product server is about to violate an SLA)

  • Real-time OLAP
  • Alerting to isolate problems while they are happening in transactions, infrastructure and business process.

In the Short-term (Why were transactions 10% slower today?)

  • Business event correlation for root cause analysis
  • Capacity management

In the Long-term (What applications can we move to the cloud?)

  • Improving user and application behavior
  • Capacity planning
  • Cloud architecture planning

Here are some of the common types of reports you will get:

APM Analytics Output

Click for larger view

Real-time analytics is a topic that deserves its own post, so I’ll cover that next week in detail.  In this post we’ll focus on the short and long term use cases.

What Makes APM Analytics Work in Production?

Ok, sounds good so far, right?  There’s a gotcha.  You knew there would be!  In order to provide right amount of actionable intelligence in a production environment, you must first start with good data.  The concept of “garbage-in, garbage-out” hold very much true for APM analytics. Here’s the secret to good APM data: entity relationships.

Entity relationships hold information about the interaction of a transaction with other components of the infrastructure.  (E.g. this transaction was in this tier for this long before moving to that tier).  Entity relationships are crucial to APM analytics because they allow you to infer root cause.  Most APM solutions cannot provide detailed entity relationships in production because they do not track all the tiers and they do not track all transactions.  This all goes back to the continuous monitoring requirement from last week.  You might be starting to see that the keys to success with APM in production are related to each.

Ok, Sounds Good.  How About Some Examples?

Sure thing.  At OpTier, we call the customizable part of our APM analytics Business Events and we’ve helped customers use it to detect the following:

  • Poorly designed SQLs as the root-cause of slow transactions
  • ESB wrongly orchestrating transactions
  • Retail banking payment transactions traversing certain application components before final booking
  • Trading transactions having specific cut-off times in the day
  • Order fallouts (common for telcos)
  • Resource-intensive batch tasks impacting online transaction activities
  • Specific users impacting system performance

Let’s take a deeper dive.  Here’s an example of APM analytics uncovering the root cause of where transactions are failing in the short term.  In the screenshot we below are analyzing transaction flows and can see that there is a missing step in the overall process flow: “Send Invoice.”

APM Analytics Process Analysis

APM Analytics Process Analysis - click for larger view

The APM solution can detect and report “Send Invoice” as a root cause because of the entity relationships.  There are relationships between tiers in a transaction flow, and when the system can understand that it can start to detect when those relationships start to change.  The next step here is for an analyst to look at the invoicing system and determine why that step in the process is not occurring.  This improves mean time to resolution, as the analyst is not forced to look at every tier, just the problematic ones.  He or she can then get the issue over to developers to fix faster than they could have before APM analytics was available in production.

That is just one example of the power APM analytics in a production environment.   Because of the depth of information, APM analysts need a way to parse through the volume to get to the root causes.  Analytics is the key to delivering this success in production

What do you think?  Have you come across any other good examples of analytics?  I’d love to hear some of your stories.

Stay tuned for the next installment of this series where we will discuss leveraging real-time analysis to proactively monitor applications.for the third part of this series.  If you’d like to be notified when the post goes up please follow me on twitter @diego_lomanto.

December 16, 2011 at 3:52 pm 4 comments

Five Keys to Success with APM in Production Environments – Continuous Monitoring (Part 1 of 5)

By Diego Lomanto (Twitter: diego_lomanto)

This is the first of a five part series where we explore the critical factors of implementing APM in production environments successfully.  Please check back next week for part two.  

If you are currently evaluating an Application Performance Management (APM) solution you probably realize by now there are several capabilities that must be included in order to maximize the value of APM.  Gartner summed these up nicely in their recent magic quadrant report.  Dynamically generated topology maps, application diagnostics, transaction monitoring, end user experience, and reporting capabilities have become the table stakes for APM these days.  I talked a bit about using these dimensions to take a business transaction-driven approach to APM in  my last post.

These dimensions are the baseline requirements when considering an APM solution.  However, maximizing your APM investment in production hinges on critical capabilities that can make or break an implementation.  Capabilities that don’t get as much coverage in the media. They are:

 

Over the next five blog entries I’ll spend a little bit of time on each of these success factors so you can be sure that you purchase and deploy a solution that will deliver the results you expect not just in development and testing environments but also in production. Let’s start with continuous monitoring:

Part 1 - Continuous Monitoring, NOT Exception-Based Monitoring

The first entry in this series deals with the value of enabling a continuous monitoring solution rather than an exception based one.  Many APM solutions have trouble dealing with high-volume environments so they function in a passive mode,  tracking mostly high-level metrics and basic KPIs, waiting for a pre-defined exception to occur.  Only then is a more active monitoring mode is entered.  Tier metrics are not a reflection of transaction health and have little to do with the end-user experience.

On the other hand, continuous monitoring solutions were built from the ground up with lower overhead so that they could run 24×7 on all transactions with low overhead.  We recommend a continuous approach in your production environment.  Here’s the rationale:

The Risk in Production with Exception-Based Solutions

There are a few problems with exception-based solutions:

  • Does not surface problems you haven’t defined as a breach in advance.  This is the main problem with an exception-based solutions.  If the administrators of the system have accurately planned for all of the breaches that might occur, then might be able to get data on problems within the environment.  But what if the breaches are not well-defined?  You end up with blind spots.  Everything looks fine because no red flags are getting reported.  But is that the reality?  How do you know if you can’t see everything?
  • Frequent smaller problems fall between the cracks because they occur sporadically and not consistently enough for the tool to decide that it is an “exception”.  However, all of these small problems often add up to poor end-user experience.  And even if such breaches do trigger the exception mechanism, what happens if it does not occur again while the exception based tool is watching?  Nothing gets reported.
  • Monitoring uncovers no problems because the issue occurred already and the system has returned to normal state.  And as soon as it goes back to passive mode the problems arise again, triggering the exception but no meaningful data.  You end up going around in circles and never truly resolving the problems.

What’s happening here is that exception-based solutions leave you with too many blind spots to manage application performance effectively.

The problem with exception-based monitoring solutions

The problem with exception-based monitoring solutions

Exception-based tools work this way in production to minimize their overhead and the amount of data that they capture. These tools were designed for helping developers debug their code, not for 24/7 production use, so they are not able to monitor and analyze millions of unique activities every day. They have to apply some sort of a selection mechanism to decide what to monitor and what can be ignored.

How Does Continuous Monitoring Help?

To deal with all future problems you need to be able to see everything.  You need to know what happened before the problem occurred and understand what’s happening right now.  You need to know what is considered normal.  Otherwise, how do you know what is abnormal?  Sometimes the problem is simply not definable in advance and flies under the radar of exception-based solutions.  For example, if an important database table gets deleted by accident, application performance might actually look to  be improving.  Exception-based solutions might not notice anything was wrong even though from the end users’ perspective all the data is gone.  This is a full-blown application outage.

Here’s what an effective continuous monitoring solution will do for you:

  • Discovers, classify and track all business transactions across multiple tiers and components.
  • Identify the exact performance details at each step that the application executes in order to quickly isolate problems.
  • Alert IT staff to developing service disruptions and anomalies long before they are detected by end users.
  • Enable IT to proactively manage application performance and prevent service level degradation or interruptions to business services.
  • Monitor transaction that had not been defined up-front as “transactions of interest”.

The diagram below depicts a dynamically generated topology map from a continuous monitoring solution that has automatically, and without any input from systems administrators, detected the true architecture of the application environment – including tiers that may be unexpectedly part of the transaction flow.

Dynamically Generated Topology in an APM Solution

Dynamically Generated Topology in an APM Solution (click for larger view)

That’s a powerful capability that you can’t get with exception-based technology.  Another example of where exception-based monitoring would fail is the common situation of a batch job or some other nightly activity that accidentally got kicked off in the middle of the business day. Such nightly processes often hammer the databases as they perform complex calculations and produce detailed reports. When running in the middle of the day, they will slow down other transactions that are also trying to access the databases.

What would an exception-based solution do?  At best, it will show that online transactions are slowing down, CPU and activity levels are high, and some systems may be running close to capacity, but it will not point to the offending batch job as the root-cause because batch jobs are not among the business activities that had been defined upfront for monitoring. The Operations manager might conclude that it is time to upgrade the hardware (because it is getting close to capacity in the middle of the day) without realizing that the hardware is just fine and the real issue has to do with a job scheduling error.

Those are just a few examples of the power of continuous monitoring in a production environment. For more you can visit the OpTier site.  What about you?  Have you come across any other good examples of a continuous monitoring solution detecting  problems that would have been missed by an exception-based methodology?  I’d love to hear some of your stories.

I’ll be back next week to discuss leveraging APM analytics to uncover root cause for the second part of this series.  If you’d like to  be notified when the post subscribe to our feed, click on the twitter button at the top of the page, or follow me on @diego_lomanto.


December 7, 2011 at 12:11 am 5 comments

What is Business Transaction-Driven Application Performance Management?

By Diego Lomanto (Twitter: diego_lomanto)

If you are in IT operations, or manage business applications, you are probably starting to hear the term “Business Transaction-Driven Application Performance Management“ more often.  At OpTier, business transaction management is our core approach to APM so I thought I’d put together a post about what this term means for those looking around the web for more information.  Let’s start with a definition:

Business transaction management is an approach to application performance management (APM) that puts the transaction as the foundation for all other dimensions of the APM model.

What does this mean and why would you do this?  By taking a business transaction-driven approach to APM, you can uncover the dynamic application performance variations that occur due to the ever increasing distributed nature of tiers in today’s modern IT infrastructure.  Your web server is hosted here, but your database lives there….oh wait there’s some interaction with a mainframe that is housing code written in the 70s.  Applications no longer operate in self-contained environments – and they haven’t been for a long time.  But the increased adoption of technologies such as the cloud over the last few years has accelerated the complexity. You need to find a method of monitoring that can traverse across all of the tiers.  And that’s where the transaction comes into play.

Business transactions are both the services our users consume of our IT applications and the singular activity that crosses all tiers to provide that service.  And if we could find a way to have that transaction update us on its health and performance as it does its work from tier-to-tier, then we can get the most accurate picture of application performance.  That’s exactly what business transaction-driven APM does.

To understand what value that brings to managing enterprise applications, let’s look at the dimensions of APM and how each dimension can be improved by using business transactions as the foundational component.  I’ll use Gartner’s recently published 2011 Magic Quadrant for APM as the source and definition for each dimension, which Gartner described as “Five distinct dimensions of, or perspectives on, end-to-end application performance have been assembled by market participants, each one essential and complementary to all the others.”

End-user experience monitoring

Gartner definition: “The capture of data about how end-to-end application availability, latency, execution correctness and quality appeared to the end user”

Additional Value of a Transactional Foundation: The transaction begins here.  By measuring application performance from the end user’s perspective, 24/7 and 100% of the time, change-impact analysis shows managers how a certain change at a given time has impacted the user experience providing a rich end-to-end analysis.

Runtime application architecture discovery, modeling and display

Gartner definition: “The discovery of the software and hardware components involved in application execution and the array of possible paths across which these components could communicate to enable that involvement”

Additional Value of a Transactional Foundation: With a transaction foundation, topology maps are derived from the true transaction path through distributed tiers.  It is impossible to generate accurate application architecture discovery without a transaction-driven approach. With such an approach not only do we achieve a living topology view of dependencies but we achieve it without the need to model!

 User-defined transaction profiling

Gartner definition: “The tracing of events as they occur among the components or objects as they move across the paths discovered in the second dimension; this is generated in response to a user’s attempt to cause the application to execute what the user regards as a logical unit of work”

Additional Value of a Transactional Foundation: The foundational concept that enables the transaction foundation. By tracing every transaction starting at the end user (see experience monitoring above) a seamless view of transaction is achieved from users, across datacenters and into clouds. In addition to providing topology maps, a business transaction approach also measures the performance and resource footprint at each tier that the transaction passes through to give you more command and faster resolution of problems.

Component deep-dive monitoring in application context

Gartner definition: “The fine-grained monitoring of resources consumed by and events occurring within the components discovered in the second dimension “

Additional Value of a Transactional Foundation: A business transaction-driven approach helps IT determine which application components  actually need deep-dive assistance.  Without it, APM tools require the user to tell them what to look for and where to look for it. This can be extremely difficult with such complexity in the application environment and with so many different people involved in managing applications and infrastructure. Moreover, when the application code changes over time (in today’s agile environment happens very frequently), the configuration of deep dive tools needs to be updated.

Analytics

Gartner definition: “The marshaling of techniques, including behavior learning engines, complex-event processing (CEP) platforms, log analysis, and multidimensional database analysis to discover meaningful and actionable patterns in the typically large datasets generated by the first four dimensions of APM”

Additional Value of a Transactional Foundation: Once again, using a transactional foundation delivers real-time, cross-tier visibility into the relationships between user actions, application behaviors and infrastructure behavior even when complex business transactions including multi-segment transactions that flow through multiple platforms and locations are involved. As opposed to combining siloed data this transactional approach provides analytics that a far more effective, intuitive and efficient to use to achieve proactive control over application performance.

Source: Magic Quadrant for Application Performance Monitoring, September 2011, Will Cappelli, Jonah Kowall

This business transaction-driven approach to APM is what we do at OpTier and we believe that it is changing the way IT manages applications.  Hope this helps you understand the term a little bit better!


October 24, 2011 at 9:37 am 2 comments


OpTier Application Performance Management

OpTier Twitter


Follow

Get every new post delivered to your Inbox.