Finding Issues

How do you find the performance problem?

Page Contents
Introduction
Example 1: Increasing load causes 302s
Example 2: Simple capacity issue
Example 3: Memory leak
Example 4: Losing records
Example 5: Concurrent users functional issues

Introduction

Everything else on this site is about tools, configuration, automation etc. But at the end of the day this is all basically peripheral. Yes, itís all very useful and even required for a performance testerís day-to-day work, BUT, at the end of the day the job is to find performance issues and risks and fix them or at least mitigate them.

So how do you actually do the job?

In an interview, a company will probably ask you some simple coding questions, maybe ask you what some terms mean, like GC stats, average load etc. They may give you a test on the tool set - record a script and correlate and parametise it. And they may go into all sorts of technical discussions. But again, thatís all sort of Ďclinicalí, Ďtext bookí type stuff.

At the end of the day, hereís the clue to finding and fixing problems:

    Relationships with devs, DBAs and the geeky(!) dev ops (sorry Brendan!)

Plus:

  1. Getting your models right in the first place (you typically need to match call ratios and cache ratios within your data sets). Watch out for databases - if youíve just populated them the indexes are likely to be all nice and neat and not at all like production
  2. being forceful in running the tests that you know you have to run (stress tests and long running soak test). Donít let stuff go out if you have not covered everything, or at least get the detail written into the sign off
  3. Run more tests!
  4. Results analysis - you really do need to look inside the application logs. You need access to these logs. You canít be expected to do performance analysis if youíre blind to half the system
  5. If you suspect an issue, you need to chase up the teams and get some investigations going. Watch out - itís not uncommon to get this: devs: No, the code hasnít changed, dev ops: no, the configuration hasnít changed, DBAís: well, to be fair they tend to send you reams of SQL cpu times in an email... Just try and decipher it - the clue could be in there(!), Network: we canít see any reason for those bottlenecks! oh dear...
  6. Donít worry though, you can glean quite a lot just from the test tools. Yes, they certainly are useful. They are designed for the job. Iíll give some examples below

Example 1: Increasing load causes 302s

This test was actually run using Jmeter in the cloud. We ramped up the load and started to see 302s. This is unusual but decreasing the load reduced the 302s and at a certain low load we didnít get them. Checks were made to show this was reproducible.

image00108

Further investigation showed 302s in the production log and it could be seen that various URLs returned 200s sometimes and 302s other times. My investigations suggested this was a live load related issue:

image00206

This was one of those cases where the devs and the dev ops both said nothing was wrong and it was probably the testing that was at fault!

I did further analysis of the Production logs, showing that although a lot of 302s were due to robots and spiders (that do not login and may have access problems), a large number of them were from real users experiencing real problems.

The issue was sort of brushed to one side - not exactly ignored. This was for a new release and I could show that the latest code was no worse than the previous code. So in context, we could go live. BUT, I did make my point repeatedly at standups, so the team knew of the problem and I believed we had a capacity issue currently in Production.

A few weeks went by and one of the functional testers noticed something. There was an issue around dates, with cookies being created slightly out of sync with the back end servers. Ha ha! I hear you say! Well, it turned out the out of sync dates in the cookies were enough to throw errors on some servers and thus give the user a 302. It got worse under load. After some investigation, the performance issue was diagnosed.

Further performance testing was setup and it was shown that my results were correct and there was an issue in Production and the performance testing did actually highlight a serious issue affecting users. BUT, it wasnít down to capacity. It was due to a bug in the code. This was odd as the error thrown was not immediately obvious, but this is sometimes the case.

After the code fix, we were able to ramp up the load 10 times more than before and no 302s were seen. This was a successful outcome.

The lesson here is to stick to your guns. I could so easily have brushed off the results as some strange test tool issue. There was no obvious capacity issue in Prod, and 302s were expected from robots so maybe this was all that was going on. But I made the situation clear from my perspective and kept the evidence active within the team. I made it known and signed of the release only with caveats.

Example 2: Simple capacity issue

This is a basic scenario in performance testing but worth showing here. We ramp up the load continually until the application canít service the requests. This gives us the capacity of (in this case) one node:

image00305

Example 3: Memory leak

As usual, a soak test was run over the weekend. This is highly recommended as standard practice. Load levels were normal (as expected in Live) and the test ran for hour after hour after hour. But after 20 hours or so it al fell over. Initially it was thought the environment had been disrupted, but after a second run with the same results (after the same time span) and with the help of devops and dev the issue was found. It was not immediately obvious even which logs to look in but we saw this in the kernel logs on one of the app servers:

image00406

To find these sorts of issues in detail (the actual area of code that is causing the issue), really does need developer input. The job of the performance tester here really is to provide a reproducible way of forcing the issue. To this end I could ramp up my testing significantly so we could force the issue within about an hour and a half. Luckily we have Dynatrace here. I do have access to this myself but again, to be honest, these tools really do need specific application developer input. You need to know the code in detail to find and fix the problem.

With Dynatrace running I could spark up my tests. I was monitoring server memory directly (with top) and could see the issue building during the test:

Performance tests can now reproduce the issue after 1 hour 23 minutes. Notice the eggplant graphs wrap around, like perfmon:

image00505

From the front end I can see the test falling over at this point:

image00604

And from the app server, I can see us running out of memory (this screen shot is actually from earlier, before complete failure) - you can see the java process currently taking 87.5% of memory on this app server. This has been steadily climbing. In fact this is not as bad as it looks. You really need to look inside the GC stats. Java will grab system memory and keep hold of it so this system level view does not show the whole picture and can lead to unnecessary concern. If memory is ultimately kept under control, this is not a problem. The GC monitoring however showed no GCing and a steady increase internally, so in this case, the app did fall over and the system memory did go beyond expected levels. I donít have all the details to hand in this case so here is what I do have:

image00704

After lots of developer input it was found that a particular 3rd party component to do with on the fly image re-sizing was the problem. And the internet showed this had been fixed in the latest version of java. But of course it is not always straight forward to just upgrade java on a project but we couldnít go live with a known memory leak (surely?). No. So it was decided to update java for this project only (not a typical good practice - having different apps using different versions, for all sorts of reasons)

Functional regression testing and extensive performance testing had to follow the update but ultimately the issue was found and fixed so this was a successful result.

Example 4: Losing records

Typically with performance tests we run at high loads and make hundreds of thousands of calls. We test for responses, hopefully always including some specific (functional) test case so we know exactly when we get failures.

But sometimes we are calling more complex applications and they in turn create records or send data to third parties etc.

In this example, I was registering users in our system. They were being stored with a third party and then another 3rd party had to pick up any updates from the first 3rd party and send emails out(!)

So I ran a typical registration profile against our systems:

image00110

I then ran a DB retrieve script against the first 3rd partyís database to get a count of records produced - this was more accurate than relying on my own script counts that I had from the test tools and this was critical in this case as there was no allowance for error. I was using eggplantís in memory DB for storing my counts but it wasnít quite as accurate for this purpose. I needed to know the definitive number that had actually made it into the DB. The DB script results had to be filtered:

image00209

The second third party then had to give me a report independently on how many records they had processed. If the numbers lined up all was well. In fact this was a long standing project with a lot of difficult requirements and the more difficult issues were actually found by tunneling each of the 3rd partyís calls through our own bespoke proxies so we could log exactly what was going on. Calls were being lost that could not be traced without this step as communication between all 3 parties was not fully open. (A good tip there for projects involving several guarded stake holders.)

Example 5: Concurrent users functional issues

We had a system with two sides to it, two different clients.

  1. Logged in web users would edit certain complex preferences, adding and deleting items
  2. This would happen at high load
  3. The other side was a pick up script within a 3rd party application that would process changes every 15 minutes
  4. It was clear that the web facing system would need performance testing under high load with realistic user data. This was done and passed.
  5. The other side had been developed separately and tested functionally with large numbers of events to process and it could process well within the 15 minute window it had available.
  6. As with all IT systems, a call had to be made (regarding time/money/risk) about whether we needed to do a full end-to-end test with real users and data hitting the web side of the app AND with the 15 minute process running in the background as per Prod.
  7. This was not a small task as data and environments needed building and the two apps brought together just for this testing, separately from all the other testing that had been done and separate from our usual staging environment.
  8. It was a close call but we decided to run the test.
  9. ---
  10. And this is where we saw an unexpected issue:
  11. Whilst real (simulated) users were both deleting and adding events, it turned out we had a bug, initially thought to be around time stamps of events. It turned out to be more convoluted than that and some fundamental changes were needed in the code to cope with concurrent usage of the app, with users deleting and adding events whilst the other side processed them.
  12. This was NOT actually a load issue, but performance testing is often the only time an application is tested before go-live with concurrent users. And this was a functional bug to do with that scenario. This would not have been picked up if that testing decision, for all the right reasons, had gone the other way.
  13. There is a definite lesson to be learned here. Be careful about thinking too rigidly about testing needs. And watch out for scenarios that could be missed even given all your different testing approaches that you do employ.
  14. We were lucky in this case and the issue was fixed before it went live. I have made this point to the teams. We shall endeavour to be diligent of these possible issues in future.

 

[Home] [About (CV)] [Contact Us] [JMeter Cloud] [webPageTest] [_64 images] [asset moniitor] [Linux Monitor] [Splunk ETL] [Splunk API] [AWS bash] [LR Rules OK] [LR Slave] [LR CI Graphs] [LoadRunner CI] [LR CI Variables] [LR Bamboo] [LR Methods] [LR CI BASH] [Bash methods] [Jenkins V2] [Streaming vid] [How fast] [Finding Issues] [Reporting] [Hand over] [VB Scripts] [JMeter tips] [JMeter RAW] [Dynatrace] [Documents] [FAQ] [Legal]

In the Cartesian Elements Ltd group of companies