Google Cloud Platform Blog
Product updates, customer stories, and tips and tricks on Google Cloud Platform
Alerting on a metric’s rate of change
October 14, 2015
Part of your daily routine managing your cloud services is to make sure bad things don’t happen. And when they do
–
you want to be notified as soon as possible. For that, Google Cloud Monitoring provides you with powerful alerting capabilities that notify you whenever conditions you’ve defined are triggered. For example, you can configure Cloud Monitoring to notify you when CPU utilization crosses a certain threshold, or your application serving latency is too high.
But what if you don’t know exactly which thresholds to use, or what “too high” really means? What if instead, you just want to notify when things change from their norm?
For that, we’re introducing the ability to
alert on a metric’s rate of change
. Instead of choosing a threshold for a certain metric (e.g. “notify me when it crosses 20”), you can now specify a percentage change instead (e.g. “notify me when it changes by more than 30% compared to it’s previous values”). This new alert condition type is especially useful for “stable” metrics that are expected to remain within a certain range and not change too abruptly for extended periods of time, such as latency, QPS and error ratios.
Understanding how metric rate of change conditions are evaluated
Suppose you want to be alerted when disk usage increases by over 30% in a time period of 15 minutes or less (details of how to configure this in the next section). Here’s how this condition is evaluated:
Every minute at time T, we compute a simple 10 minute moving average for the metric in question (in this case, disk usage), at times T and T minus 15 minutes.
We then check the difference between the two, and if it exceeds the percentage specified, the alert condition is triggered.
Note that the 15 minute period is used as an example and is configurable. However, the 10 minute moving average is fixed, and is used to smooth the metric and prevent false positives. Without this smoothing function, a quick spike in the metric would trigger the alert condition even if it’s very short lived and resolves itself instantly.
You can configure the increase and decrease of a metric separately, so in the above example a drop of 30% or more would NOT trigger this alert condition.
Lets look at the visual example above, using the same condition previously described. As you can see, disk usage starts rising a little after 9:30 (where the gray vertical line is), but the alert condition isn’t triggered yet since it only increases by 20%. Around 9:45 it increases some more and triggers the condition (the red horizontal line at the top of the graph).
The condition is auto resolved at around 10:10, since looking back 15 minutes doesn’t yield any increase of 30% or greater. In other words, the metric has stabilized for 15 minutes. The condition is again triggered a little after 10:30 when disk usage increases from about 40% to over 80%.
Configuring metric rate of change alerting conditions
To create a rate of change alerting condition, simply choose “Metric Rate of Change” from the “Add Condition” menu when creating or editing an alerting policy. Like other condition types, Metric Rate of Change can be combined with other conditions to create a multi-condition alerting policy.
When configuring this condition, you can choose the metric, whether to trigger on an increase or decrease, the percent of increase or decrease that would trigger the condition, and the time window to examine for change. The screenshot below shows the condition we examined in the previous example.
Note: you should take into account metric changes related to time of day, to avoid false notifications. So for example, if average QPS increases tenfold in the morning, this might unnecessarily trigger alert conditions that check for increased QPS activity.
Conclusion
Alerting on a metric’s rate of change gives you more flexibility when monitoring dynamic systems without the need to explicitly define
–
and maintain
–
static thresholds. These conditions adapt to the way your system behaves, part of our ongoing effort to provide you with a monitoring system that requires less maintenance, so you have more time to innovate. Give it a try
–
look for a stable metric that should remain within a given range, and configure a percentage increase or decrease that would signal a change that’s worth investigating. You can read more about setting up alerts
here
.
Please send us an
email
and let us know what you think. Your feedback helps us improve.
-
Posted by Amir Hermelin, Product Manager
No comments :
Post a Comment
Free Trial
Labels
Android
Announcement
api
app engine
Atmosphere Live
bigquery
BigTable
CDN
Cloud Console
Cloud Dataflow
Cloud Datastore
cloud endpoints
Cloud Pub/Sub
Cloud SDK
cloud sql
cloud storage
Cloudera
Compute
Compute Engine
container cluster
customer
Dev Tools
developer tools
developer-insights
Developers
Developers Console
devfests
Disaster Recovery
Encryption Keys
ESG
Event
events
GA
Go Client
Google App Engine
Google Apps
Google BigQuery
Google Cloud Deployment Manager
Google Cloud Networking
Google Cloud Platform
Google Cloud Storage
Google Compute Engine
Google Container Engine
gRPC
hadoop
Hardware
Helium
how to
IO2013
iOS
Kubernetes
Levyx
Local SSD
mapreduce
Media
Nearline
networking
open source
PaaS Solution
Partner
Pricing
Research
round-up
Server
Siggraph
solutions
Startup
Tableau
TCO
Technical
Windows
Wowza
Zync
Archive
2015
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Follow @googlecloud
No comments :
Post a Comment