Posts tagged ‘Application Performance’
Online Banking, Still Open for Business!
By Jonathan Williams
A recent incident at a customer site illustrates how OpTier BTM can play a crucial role in detecting, isolating and remediating performance issues before business-critical services are severely affected.
At a large UK bank, OpTier BTM is used to monitor the central internet banking application. With 4 million business customers using the bank’s site, OpTier monitors over 40 million transactions every day. During a recent Friday morning, OpTier BTM detected a marked increase in application response times as well as a large number of errors. It was absolutely critical to address the issue right away, because not only was it the peak time of day, it was also the last Friday of the month – payday for many people – and the last work day before a 3-day bank holiday weekend.
As you can see in the graph above, OpTier BTM showed an increase in average service time (the blue line) and errors (black area) after 9:50 am. Because the timing was so critical, the bank decided to switch over to their remote contingency data center. As you can see in the graph, the performance improves after 10:50 when switch was made. Even after the switch, we still see some errors because a public-facing internet application it is constantly hit by incorrect URLs – from end user typos to automated Trojans and hack attempts.
While the failover was taking place, the team used OpTier BTM to isolate the cause of the problem. In the graph below, the OpTier dashboard shows a marked increase in service time for User Identification and Verification database calls from the application server. Since nearly every transaction in the application makes a call to this database – even after the user is logged in – nearly all application functionality was affected by the slowdown.
In the drill-down to an individual transaction instance, we can see that calls to the identification and verification database were taking almost 2:30 minutes to perform.
When we drill down into the topology of another transaction instance, we can see that there is a very large Inter-tier time of 1:41 between Apache and WebSphere, indicating a communication problem. This behavior is usually an indication that the WebSphere resource has been exhausted while waiting for backend availability. This would be a secondary effect of the slowdown of the database service.
With the information provided by OpTier BTM, the bank was quickly able to identify that the source of the problem was in the database, resulting in very fast problem resolution and preventing an all hands call that would have wasted valuable time for all of the silo teams (i.e. not only DBAs but also architects, Java developers, network teams, and representatives from other IT silos). The bank’s DBA quickly pinpointed the source of the problem using OpTier BTM data – one of the nodes in their database cluster had reached its session limit. Without OpTier BTM, even isolating the problem would be like searching for a needle in a haystack.
Thanks to OpTier BTM, the problem was identified, addressed and resolved as efficiently as possible. Customers were able to deposit their pay and – along with the bank’s support teams – enjoy the holiday weekend.
How Clouds will change Business Transaction Management
by Anonymous, January 2011.
I hate clouds, they generally deliver cold weather and make life dull. I especially hate them even more because they’ve recently made my job more difficult (and working in product management it’s not exactly plain sailing at the best of times). I did try my best to avoid Cloud Computing by simply pretending it was all madness. Sadly, this naive approach didn’t work and here I am writing a blog on the subject.
For anyone whose tried to decipher cloud computing I will hereby explain what the Mary Poppins is going on and how it’s going to impact IT management and specifically BTM over the next few years. I will start by saying that things are going to get more complex and significant challenges are ahead for vendors who are looking to provide next generation IT management software. There are several acronyms you need to understand as well so I’ll get cracking:
Private Clouds – think of these as on-premise utility/grid computing with the virtualization of OS and application run-time environments across the enterprise. An example might be a grid of 500 J2EE servers which are virtualized and shared across hundreds of different applications within an enterprise.
Public Clouds – this is simply off-premise utility computing provided by a 3rd party vendor. For example, Amazon EC2 or Rackspace where businesses can buy computing resource on-demand which are accessed remotely across the internet (hence it being public).
SaaS – Software As A Service. Enterprise Applications that are hosted on the internet by a 3rd party vendor. For example, Salesforce.com, Success Factors or GoogleMail where businesses log into a website that provides them with specific services that aid their business.
PaaS – Platform As A Service. Application Run-time platforms that are provided by 3rd party vendors across the internet. For example, Google App Engine or Salesforce.com’s AppExchange. The ability for business to build new applications using 3rd party frameworks or run-time environments. For example, many businesses will store their customer data within Salesforce.com, using AppExchange they can build new applications on top of this data.
IaaS – Infrastructure As A Service. Essentially the same as Public clouds where businesses can buy servers or computing power on demand from a 3rd party hosting provider.
Hybrid Cloud – combination of all of the above.
Some of the above is probably common knowledge and I’m betting someone will comment on this blog telling me the above descriptions are not entirely accurate. The key problem with the above is that enterprise applications are going to become more fragmented and distributed across multiple deployment platforms which are not all controlled by the customer. To add to this we’ve just had a decade of SOA projects which essentially increased the number of dependencies between applications so when a user executes a business transaction these days it’s likely to pass through several application architectures. Why is this important? It multiples the complexity and demands of IT management software which up until now has still struggled to monitor and manage single applications let alone multiple connected applications. In summary a blackbox application becomes a blackbox of blackboxes with multiple points of failure and dependencies. Visibility of how the business (transactions) executes across these blackboxes therefore becomes key to effectively managing the business and IT. Business Transaction Management solutions will be key to providing this much needed visibility across the many types of blackboxes regardless of whether they’re in a data centre, in a cloud or being managed by a 3rd party vendor. You can only manage and control what you can see, as many enterprise applications move to the cloud its critical customers maintain their visibility of how their business executes across IT.
CEP doesn’t have to be complex
By Anonymous, January 2011.
One of my favourite sports is Formula 1. For the unfamiliar it involves 22 cars racing flat out at over 200mph with drivers bums 2mm from the ground with many of them crashing and going up in flames (see below). It differs from traditional Nascar racing in the fact it has these things called “corners” which make it more tricky for the drivers to overtake. Formula 1 is a big business with many teams spending over £150 million plus a year to make their car faster than everyone else. It’s a global sport with significant sponsorship, TV revenue and an opportunity for car manufacturers to compete. To say business impact doesn’t occur in Formula 1 is pretty much the same as saying no-one gets hurt in boxing.
So how do these teams minimize business impact and make their cars finish races? Firstly they have a lot of talented people whose job it is to design, develop, test and support these cars that cost £1.5 million each. Secondly they are experts in monitoring and improving one important metric: performance. Each car has 2,500 metres of wiring and over 250 sensors which continuously monitor the performance of car components in real-time. The data from these sensors is often known as “telemetry” which are fed into a computer and then analyzed by test or race engineers. Over a race distance millions of events are captured from each car and are used by the pit wall to help their cars finish the race. Engine temps, tyre temps, brake wear, hydraulic pressure, tyre pressures, brake temps, clutch wear – the list is endless. The job of the race engineers and their computers is to spot which events matter so they can take pro-active action (Complex Event Processing). They make definitive decisions to directly increase the performance and reliability of their car so it can finish the race as high as it possibly can. For example, if tyre pressures are low it could mean a number of things from a simple slow puncture to a problem with the brakes which is causing tyre temps to drop thus impacting tyre pressure. The last thing a Formula 1 team want to do is pit their car so they need process and analyse multiple events to make the right decision. Just like failing businesses go out of business so does Formula 1 teams with the recent departures of BMW, Toyota and Honda.
A formula 1 car must be fast and reliable for its team to be successful. The same principle can be applied to any business out there that has mission critical applications or business services. Slow performance and outages have a direct business impact. The only difference is that there is probably a lot more wiring (networks) and sensors (agents) used to monitor every angle of an application through the various OSI layers. Complex Event Processing engines add significant benefit to gaining meaningful real-time intelligence from data that is collected. It allowing monitoring solutions to become smarter with the data they collect and present, it also makes monitoring solutions aware of data from other sources that may explain why specific events are being observed. For example, if an application tier goes down the monitoring solution may throw an alert. However, if this was planned downtime or a change request then the tier outage is perfectly valid. With CEP capabilities it’s possible to build simple rules that prevent false positives and alert storming. For example, a CEP engine can process a tier outage event and then query the change management repository to see if downtime is planned, if not it can then alert to say the tier has been verified down. This is just a very simple example of how a CEP engine can significantly enhance traditional IT monitoring solutions.
In fact, the power of CEP is exactly why OpTier recently introduced its Business Events module (BEM) so our customers can gain better intelligence into what is impacting their business. In the same way we use the market leading Oracle database to persist our data we use a market leading CEP engine to process events from the millions of business transactions we collect each day. For every business transaction captured we know which application, business process, user, location, tiers and protocols it touched along with the KPI such as latency, resource and SLA for those respective entities. So if a user from an unauthorized IP subnet executes a business transaction we can detect it in real-time and notify the application security team. Again, just a simple example of how CEP capabilities can enhance Business Transaction Management.
Does change management impact your infrastructure or your business?
I’ve witnessed a lot in IT over the last decade. I’ve seen a DBA blow away (rm -rf) a live production database thinking they were logged into a test server shell by mistake. I’ve seen websites go bang several hours before and even several minutes into major product launches. I’ve filled out many change requests in my time with many of these processed by people who actually forgot to make the relevant changes despite signing off the change requests as completed. I’ve also seen many customers deploying applications into production based on configuration they used in test environments with debug logging enabled. The best one recently was when a security guard accidently locked themselves in a data center room and hit a button thinking it was the door release when in actual fact it was the EPS power button which knocked out the entire power to the data center. We can blame the rise of the machines for our IT woes but the biggest liability by far is still us human beings
Today, the only thing constant throughout the application lifecycle is change. Building an application is relatively cheap, supporting and maintaining it is where the costs start to spiral out of control. Change requests are an expensive activity, they require development, regression testing, documentation, planning, downtime, backup procedures and an eye for detail. However, when a change occurs how many organisations can truly quantify the business impact?

What exactly changed?
For example, a DBA might look at the top 5 slowest SQL Statements that execute in the database. They might optimise these in several ways by creating a few indexes, updating relevant table statistics or tweaking I/O settings. Various change requests are then submitted which are then deployed in production. What the DBA doesn’t understand at the time is what impact their changes will have on the business. Their database could be serving multiple applications spanning hundreds of business transactions with thousands of users. Introducing a new index on one table might improve one SQL statement but it could have a detrimental effect on several other SQL statements which collectively could impact several key business transactions. It’s therefore virtually impossible to quantify whether changes like this will have a positive impact on the business.
Same goes for an application developer. I know because I’ve been there and tried to optimise many JVM’s with APM tools in the past. I could spend all day knocking milliseconds off Java API calls or playing with container settings like connection pools or thread counts in a vain attempt to optimise the application sitting on top of the JVM’s. You can find 101 interesting things a day to optimise with an APM tool. The trick is knowing which things will actually impact the business in the most positive way. Its also good to know when to stop tuning – the more you change the more you need to test. When your tweaking application code or changing container settings its not that easy to figure out what business transactions your playing with. Again, you might be tuning your JVM’s to make them more efficient but being able to truly understand the business impact of your actions is still a black art. If a dev team of 5 people spends 4 weeks tuning application code and only improves business transaction response time by 5% did they really do a great job? Did the 5% improvement impact important business transactions or did it impact less important business transactions?
Another problem is knowing when to schedule a change request. Many applications these days are 24/7 and global. No longer can organisations rely on midnight change requests. You want to schedule change requests at times with the least business impact. How many users are logged on at this time? How many business transactions execute at this time? Are the business transactions important or can they suffer unavailability?
Business Transaction Management solves a lot of these change management issues. When you capture all business transactions across all tiers all of the time you have full visibility into how each change request or tier impacts your business transactions and ultimately your business. You can also identify the best time to schedule changes based on business transaction activity. When Change Request #5463 was deployed it improved the SLA for several key business transactions by more than 25%. When Change Request #7653 was deployed it improved the response time of Execute Order by 80% but actually degraded the response time of Cancel Order and Check Customer by almost 350%. This is just a small sample of the benefits BTM can bring to change management.






