Tuesday, September 23, 2014

Leveraging PerfMon to monitor application / server health

Here I would like to discuss about a business scenario and solution we provided. Sorry to say that this post contains less code. 

I am not saying that this is the ultimate solution. Anybody can criticize so that I can improve the solution. 

Background

We have a generic queuing system in our project to off load long running process. The web layer which includes web site and web services just put the request in the queue so that it can return the response in 2 seconds by allowing the processing machines to process at a later time. The processing is done in a pull fashion instead of push fashion. Each processing machine (Windows 2008 virtual machines) knows what is its processing capacity and how many queue messages are currently running on it. Based on the current available processing capacity it de-queue requests / messages from the queue and process it. There is a proprietary algorithm which we developed for de-queuing messages based on other factors too such as priority of individual message or message type etc...

There is a web portal to monitor this environment. It tells how many messages waiting in the queue, what are the messages in progress, how many failed etc...

I would say its a cloud setup where we can add more processing servers to scale out based on the processing power requirement. But I don't know why nobody else in our project call it cloud :( . May be those poor guys don't know what is cloud? Or they didn't get any chance to experiment with commercial cloud and see how it scales :)

The problem here is to give a solution to visualize this queuing system in correlation with system performance. In other terms, the admins of this environment needs to see what is the processor utilization in a each processing machine when that is processing it's maximum number of messages. Based on that they can decide whether currently configured "maximum number of messages which can be processing concurrently" is really utilizing full system resources. If the concurrent requests configured is high than it's capacity ie over utilized, we can expect more failures (due to timeouts or memory problems ) in the processing servers. Else the system will be in underutilized state.

The solution

Approach 1

As I mentioned earlier, there is a web portal which displays details about queue system. Initial suggestions were to have a capturing mechanism to capture processor and memory snapshots of all the processing servers and display in the web portal using third party graph libraries.

Pros

  • Full control over data collection and display

Cons

  • Need license for third party. We could use open source but the client don't allow that
  • More development effort
Is this the best solution?

Approach 2

If we think about alternatives, we can end up in 3-4 alternatives. The one which got major support is "Using perfmon to show the queue details".

The solution is simple. Microsoft already invested enough in creating beautiful graph based UI for monitoring system performance and its called Perfmon, bundled along with Windows operating system. We can create our own performance counters and get it displayed there as well. We have such a nice tool, then why can't we leverage that to show queue details there and correlate with the system performance such as the processor utilization, memory etc...

Pros

  • No need of third party for visualization.
  • Simple to implement

Cons

  • Tomorrow if Microsoft discontinue perfmon, need to find alternative
  • Needs special permission if we are creating new perfmon category.

Code snippet

Without code, its very difficult for me to stop the post. So just sharing the links towards working with PerformanceCounters using .Net.


Registry permission for creating perfmon
http://bytes.com/topic/net/answers/501945-requested-registry-access-not-allowed-performance-counte

No comments: