- Why another metrics library?
- Moving Averages
- Histograms and Percentiles
- Application Modes - Normal, Heavy, Unresponsive
- Agent enhancement
Why another metrics library?
When I began looking for a Java metrics library to adopt I ended up using and forking Coda Hale Metrics which is a very successful metrics library. For Timed events Coda Hale metrics emphasises Histograms/percentiles and Moving Averages . Avaje metrics is orientated towards reporting metrics every minute or better and in that scenario I developed a significant preference for simpler and cheaper counters (count, total, average, max) as well as some other features - separate error statistics, BucketTimedMetric.
When collecting and reporting metrics every minute I found the expotentially weighted moving averages to be quite laggy. For interpreting minute by minute activity the simpler counters (count, total, mean, max) provided a much more accurate representation of the behaviour.
Histograms and Percentiles
Percentiles are definately the way to go if the period in which you are monitoring is relatively long. Said another way, if the period you are collecting metrics for is 1 hour then simple counters like mean and max can be next to useless - the server could have spent 55 minutes in normal millisecond response and then some resource gets maxed out and for 5 minutes the reponse times went off the charts - a single mean and max covering the 1 hour period is not going to provide much insight or help. You need to collect frequently enough so that you can discern the varous 'modes' of the application (normal/good, heavy load, terminal/unresponsive/dead) and that the statistics of these modes are relatively separate/distinct.
Histograms typically hold a sample size of collected metrics and in this sense they require more memory. In addtion they have more overhead compared to LongAdder and LongMaxUpdater.
Application Modes - Normal load, Heavy load, Unresponsive
Generalisations for applications are dangerous but frequently applications have different 'modes' based on the load they are handling.
- Good / Normal Load - Server responding as expected under light or moderate load. No resources have hit their limit.
- Heavy Load - One of the resources has hit a limit, response times take a jump.
- Unresponsive - Multiple resources hitting limits (perhaps with feedback on each other), response times make large jumps and tends towards unresponsive as load increases.
If your application has 'modes' then the response times do not increase in some linear fashion but instead 'jump' - there are inflection points at which the 'mode' changes and response times jump significantly. IF the collect statistics cover multiple 'modes' then a single mean and max that span both of those modes is relatively unhelpful - perhaps even useless.
Histograms excel when the statistics collected cover multiple 'modes' (and max and mean are not as useful). If you collect frequently enough then the issue when statistics span multiple 'modes' is reduced and you can use simpler and cheaper counters (count, total, mean, max).
One aspect of BucketTimedMetric is that it provides a simple way to separate the statistics collected for 'normal' mode response times from the other modes. Separating those statistics provides a clearer view that a single mean and max will not. BucketTimedMetric has other uses like clear monitoring of SLA requirements and monitoring a method that has multiple execution paths (fast path with cache hit, slower path with data fetch).
BucketTimedMetric also has the nice property that it can be aggregated accurately.
Agent enhancement provides the ability to instrument an existing application without requiring code changes. With the developer coding effort removed (at least for TimedMetric and BucketTimedMetric) you can get good application performance metrics at low cost to the project.
No excuses - any 7+ JVM application can have performance metrics collection.