Tuesday, April 21, 2015

Determining health of system in distributed queueing system to dequeue next message

Recently we are in to the business of building a distributed system where the long running operations to be offloaded to processing machines. Processing machines are pulling the messages from the queue instead of having a manager allocating tasks. 

Initially we were assigning throttling numbers to each of the processing machines according to their configuration such as number of processors, RAM etc...But soon we could see that the machines are either not utilized or over utilized. So decided to introduce a mechanism where the real time machine load needs to considered before taking new message from the queue.

We considered some techniques for sensing the load but those were not feeling better than analyzing the Windows performance counters. So decided to go with looking at the appropriate performance counters before dequeue a new message from the queue.

What are the appropriate performance counters? Its always debatable. We are in the initial stages of the implementation. Hopefully can update soon.

Another challenge we faced was how the performance counter data is translated to a boolean value saying system is healthy or not. We evaluated PAL which reads the .blg files and produces report. But finally reached to a conclusion of saving the performance counter values into database and running own rule engine there which replaces PAL.

Some links on how to work with performance counters from perfmon.exe below

https://technet.microsoft.com/en-us/library/cc722414.aspx
http://www.windowsnetworking.com/articles-tutorials/netgeneral/Scripted-Networt-Defense-Part2.html

No comments: