X marks the spot. In the case of IT this loosely translates into the moment when an exclamation of “WTF!?!” is emitted by a seasoned developer. Its that same moment when they stumble across the one problem that is robbing your system blind of decent performance.
Once the blueness has cleared from the air, its down to business to resolve the problem at hand. Which IS a lot simpler than it sounds.
Ah, grasshopper, how did we get to this Eureka moment? Sheer graft I am afraid. There is no one silver bullet that you can fire into a system that will reveal the ugly truth. It takes patience. It takes planning. And it takes a long long time … no wait! It doesn’t need to.
So the more cynical of you out there in internetland will be scoffing at this point. Pah, you can spend YEARS and never find the real cause. Better to throw the whole thing away and start again, I hear you say.
This view does have merit. Sometimes. If you have a tolerant and rich sponsor who is willing to wait a year or so for you to come up with exactly the same software thats slightly faster than the current. IMHO this choice should only be seriously considered when the rewritten application is going to add appreciable functionality to boot. Otherwise the cost benefit ratio doesn’t look too good.
Our experience has been that to achieve that Moment of Clarity (© uptonconsulting 2010, heh), you need to benchmark the entire system.
What?
Benchmark. But not just in terms of one operational parameter, such as response time, but for ALL aspects of the application. Throughput, response time, latency, volume, paralellism and so on. There are many texts on what to measure, but few suggest how to measure them usefully.
OK, so now what? Simple. Don’t think like a geek. Really. Don’t get lost in the bits and bytes of individual call stacks. Don’t spend days grovelling around like a mole tracking message flows between log files. Take a systemic view of the application from a business perspective and measure that scientifically.
- keep the system in the same configuration as Production
- ensure that the test environment is close to production as possible.
- have realistic input load injectors (see below)
- make sure that you can measure the things you want to measure, easily. Get the right tools. A profiler is a very handy thing to have in the toolbox.
- keep the benchmarking exercise as simple as possible. Resist changing dataloads on the fly.
- make sure you can reproduce your measurements on demand. No reproducibility == No benchmark
If you cannot set up a consistent and predictable load on the app on demand you will be wasting your time. Have a good look at your load test harness. Get the load right and reproducible, all other factors will tend to fall in line.
What do we measure? Load up the system with the same stress as is seen in production. Better yet, replay prod data through your prod-like test environment. This means that you have captured your inputs in such a way that you can replay them.
As an aside, most systems experience three operational states:
- idle – the world is asleep and therefore so is the app
- steady state – the application is ticking over with a fairly predictable input load
- high load – the application is being stressed by an external event that causes a flood of input. Typically market opening times, approx 5-10% of the apps up time.
Make sure that you can provoke your system into each one of these three states on demand. Reproducibly.
One thing here. If you have a system that spends more than 5-10% of its time under stress, ask yourself: What has changed? New inputs? Is someone doing a DoS on you upstream? Has someone in your team mucked up the inter component message flow causing a message storm? The world may have grown faster than your original capacity planned. You might be heading back to the drawing board.
Once you can load up as you like, my preference is to look at the blindingly obvious.
- CPU thrashing on all machines acting as hosts – that includes the DB
- Memory footprints – again, everywhere
CPU thrashing indicates poor algorithmic execution i.e. Muppet programming. An execution pathway is doing something far more often than it should. Deep memory footprints point to bad memory management such as crap caching, poor object destruction, badly managed collections. This can manifest in many ways. More on that later.
Its not so much the fact that you witness these observations, but rather WHERE you see them. If you think about it, DB operations are THE MOST EXPENSIVE operations an app can make. They involve IO, sockets, net, more sockets and yet more IO. So make sure you have a good hard look at what your DB is doing. If your app is making more calls to the DB than it should, as sure as a Cox’s Orange is an apple, you are burning loads of time for no reason. Too many DB calls show up as high CPU on the DB and can also contribute to bulging memory allocations to the app. Notice I didn’t say tune the DB calls? Only do that when you know you are making the RIGHT ones every time.
If you have just CPU problems in the app, profile the execution pathways. Someone will have put some dumb recursion call in somewhere that is having a hard time breaking out. Or you could be making too many DB calls…
If you have just memory problems in the app, hunt down the cause using a profiler. If you have them on the DB host, grab a DBA by the throat and don’t let go until has given you stats on the calls creating the biggest dent. Likewise if you see memory issues in a 3rd party resource you are reliant upon – queue managers, application servers – get evil on the vendor’s after sales support until you see what is causing the problem.
If you get to this point and you haven’t seen CPU twanging or Memory bashing then its fairly obvious what to do next. Check your code for deliberate waits. Profile for deadlocks, long waits and slow IO. Something is gunging up the system and it most likely YOU. Lastly check if your environment sags under load. Net IO is a favourite. MOM resources can also get fragile if they are abused by a high volume of heavy messages.
Other things look at but to avoid getting hung up over:
- execution pathways. You can only look at one of these statically when you read/debug code. Let the profiler take the strain and work out where the CPU/memory is/isnt going
- MOM queue depths. They signal a problem somewhere else in your system. They are not the cause.
- MOM queue residency times. This is symptomatic of a sluggish message consumer, they are not the cause of a slow system
- service times in each execution pathway. A problematic pathway will show up in the Profiler. Don’t waste your time digging timestamps out of logs. All this will do is a) confirm the profiler’s view on the app and b) waste loads of time. The business view has a perception of “fast” or “slow”. Your system is “slow” otherwise you wouldn’t be doing this work, would you? Don’t make a fool of yourself by confirming what is already noted by your sponsors.
Guaranteed you will find something by the time you work you way to here. Remember, think about what you are seeing. Resist the urge to tweak something if you are not certain beyond reasonable doubt that the thing to be tweaked is in fact the cause. Lots of systems get buggered exactly in this way.