{"_id":"56a20e302255370d00ad5ecb","version":{"_id":"56a1f77542dfda0d00046288","__v":9,"project":"56a1f77442dfda0d00046285","createdAt":"2016-01-22T09:33:41.397Z","releaseDate":"2016-01-22T09:33:41.397Z","categories":["56a1f77542dfda0d00046289","56a1fdf442dfda0d00046294","56a2079f0067c00d00a2f955","56a20bdf8b2e6f0d0018ea84","56a3e78a94ec0a0d00b39fed","56af19929d32e30d0006d2ce","5721f4e9dcfa860e005bef98","574e870be892bf0e004fde0d","5832fdcdb32d820f0072e12f"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"project":"56a1f77442dfda0d00046285","parentDoc":null,"__v":3,"user":"56a1f7423845200d0066d71b","category":{"_id":"56a20bdf8b2e6f0d0018ea84","pages":["56a20e302255370d00ad5ecb"],"project":"56a1f77442dfda0d00046285","__v":1,"version":"56a1f77542dfda0d00046288","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-01-22T11:00:47.207Z","from_sync":false,"order":3,"slug":"features","title":"Features"},"updates":["57a9ddaebb7e5f2000af8588"],"next":{"pages":[],"description":""},"createdAt":"2016-01-22T11:10:40.710Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"settings":"","results":{"codes":[]},"auth":"required","params":[],"url":""},"isReference":false,"order":5,"body":"Alerting system is built on top of the metrics collected by our agents running on your infrastructure.  To learn more about what kind of metrics do we collect, check out the [metrics documentation](doc:metrics).\n\nYou simply don't have enough time watching the metrics page and trying to figure out problems with your infrastructure. \nThis feature allows you to set an alert for your future self, or your team when something bad occurs based on the collected metrics.\n\nThe alerting feature is reachable from the sidebar via the bell sign.\n\n## Setting up a new alert\n\nIf you are not sure which metrics are important to you, try using one of our presets.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/7c818a5-Screen_Shot_2016-08-10_at_PM_4.52.52.png\",\n        \"Screen Shot 2016-08-10 at PM 4.52.52.png\",\n        1163,\n        470,\n        \"#2f556a\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:callout]\n{\n  \"type\": \"warning\",\n  \"body\": \"Our services check your metrics every 3 minutes. That is the most time that you'll have to wait to get notified, if something goes wrong, you'll have to take that into account when creating time sensitive alerting, if you set one for less than 3 minutes, it might not be 100% accurate.\"\n}\n[/block]\nThere are 4 kinds of conditions that you can set an alert for:\n- response time\n- error rate by status code\n- service downtime\n- memory\n\n### Response time\n\nResponse time is a valuable metric when you're dealing with a web service, if your responses are slow that can cause serious problems for your users. You can get notified about your slow responses.\n\nSetting up a response time alert requires the threshold in milliseconds.\n\n### Error rate by status code\n\nWhen setting up error rate by status code you have to set the status code threshold that you'd like to alert for, the recommended setup here is 500 - 600 because all the server related errors are 5XX. If those happen that means something is wrong in your service's code.\n\nError rate means that we take all your status code reports from your services, and see what is the rate of the bad status codes.\n\n### Service downtime\n\nThis is the most important alert to set up. Downtime means that your service was unable to report to the Trace by RisingStack servers. It could mean several things, maybe your app is unable to reach the internet, or it has crashed, or something went wrong during deployment and it's not working. These issues are critical so you probably want to be notified as soon as possible.\n\n You can set up downtime in minutes, note here that the minimum you should be setting this value is >3 minutes, otherwise it could be non-accurate.\n\n### Memory\n\nIf you know how much memory you have available you can set up alerting to scale up your machines, before they run out of resources. You can set the thresholds in megabytes. \n\n## Setting up alert channels\n\nWhen you're setting up alerting feature, keep in mind that you have a warning and a critical level of alert. You can use it to set up different channels for different type of alerts, for example you don't want to wake up in the middle of the night when the alert is at warning level, but you'd like to get an email about it. You can set up different alert levels for different types of channels.\n\nWhen an alert happens, you want to get notified somehow, by default Trace by RisingStack provides an integrated incident UI, but if you'd like to integrate it with your existing tools, you can do that the following way:\n\nThese are the integrations available in trace:\n- [Slack](http://slack.com)\n- [Pagerduty](http://pagerduty.com)\n- [OpsGenie](http://opsgenie.com)\n- Webhooks\n- Email\n\n### Slack\n\nFor Slack you have to go to your team's slack apps page and look for incoming webhooks.\nhttps://yourteam.slack.com/apps/A0F7XDUAZ-incoming-webhooks\nOn the integration settings you can set up which channel you'd like to post to and other useful options, but the thing we need is the webhook URL. Copy that and insert it into the field on the Trace by RisingStack UI.\n\n### OpsGenie\n\nThe OpsGenie UI allows you to set up API integrations, on the sidebar select integrations and look for a cogwheel icon for the API integration, click on that and copy over the token from the Api-key field, into the Trace by RisingStack UI.\n\n### Email\n\nYou can set up an automated email notification from the Trace by RisingStack UI, the only thing you need to do is to enter an email address that you'd like to send to.\n[block:callout]\n{\n  \"type\": \"warning\",\n  \"body\": \"You can only send notifications to a single email\"\n}\n[/block]\n### Webhook\n\nYou can make Trace by RisingStack call an endpoint via HTTP. Just give it a URL and it's going to POST with a `Content-type: application/json` header. The payload will look like this: \n\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"{\\n\\t\\\"value\\\": 411,\\n  \\\"reason\\\": {\\n    \\\"type\\\": \\\"warningToError\\\",\\n    \\\"description: \\\"warning to error\\\"\\n  },\\n  \\\"alert\\\": {\\n    \\\"name\\\": \\\"Memory alert\\\",\\n    id: 23\\n  },\\n  \\\"infrastructure\\\": {\\n    \\\"name\\\": \\\"My infrastructure\\\",\\n    \\\"id\\\": \\\"aaab332abb1a11bbab\\\"\\n  },\\n  \\\"service\\\": {\\n    \\\"name\\\": \\\"My service\\\",\\n    \\\"id\\\": \\\"aaab3323aa1a11bbab\\\"\\n  }\\n}\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\n## Actions\n\nYou can set up actions that can be done when an alert gets triggered.\n\n### Profiling actions\n\nIn Trace by RisingStack you have the ability to create memory heapdumps and CPU profiles, this feature is integrated into the alerting feature too. If you add it to your alert, reaching a specific threshold can automatically dump your memory so you can inspect it later!\nClick on the warning or error checkbox to add the action to your alert. When it is triggered you'll be able to download and inspect on the profiling UI, that can be found via the stopwatch sign on the sidebar.\n[block:callout]\n{\n  \"type\": \"danger\",\n  \"body\": \"Be careful, creating a memory heapdump in a running program is a very expensive operation, and it requires additional memory.\"\n}\n[/block]\n## Setting up useful alerts\n\n### Downtime\n\nWith alerts you need to maintain a proper balance between keeping your services running and not being woken up during the night. \n\nIn case of crucial web processes you might want to keep the downtime as low as possible in order to prevent losing customers. This means that you might want to set warning levels to 3-5 minutes and critical to 6-10 minutes. \n\nOn the other hand some problems such as temporary connectivity issues get solved on their own by time. So eg. if you have workers that communicate through a messaging queue or by writing to database then polling it for new data, you might not want to be woken up every time a query timeouts or connection to the queue is lost. So in these cases you might safely get away with setting downtime warning level to 15 minutes and critical to 30 minutes. This way you can make sure to be alerted when things are actually going downhill, yet you can sleep through if network access is patchy but satisfactory.\n\nObviously, your mileage may differ so use these values as starting points then experiment with the settings until you figure out which works the best for which service in your case.","excerpt":"Never miss an important event again!","slug":"alerting","type":"basic","title":"Alerting"}

Alerting

Never miss an important event again!

Alerting system is built on top of the metrics collected by our agents running on your infrastructure. To learn more about what kind of metrics do we collect, check out the [metrics documentation](doc:metrics). You simply don't have enough time watching the metrics page and trying to figure out problems with your infrastructure. This feature allows you to set an alert for your future self, or your team when something bad occurs based on the collected metrics. The alerting feature is reachable from the sidebar via the bell sign. ## Setting up a new alert If you are not sure which metrics are important to you, try using one of our presets. [block:image] { "images": [ { "image": [ "https://files.readme.io/7c818a5-Screen_Shot_2016-08-10_at_PM_4.52.52.png", "Screen Shot 2016-08-10 at PM 4.52.52.png", 1163, 470, "#2f556a" ] } ] } [/block] [block:callout] { "type": "warning", "body": "Our services check your metrics every 3 minutes. That is the most time that you'll have to wait to get notified, if something goes wrong, you'll have to take that into account when creating time sensitive alerting, if you set one for less than 3 minutes, it might not be 100% accurate." } [/block] There are 4 kinds of conditions that you can set an alert for: - response time - error rate by status code - service downtime - memory ### Response time Response time is a valuable metric when you're dealing with a web service, if your responses are slow that can cause serious problems for your users. You can get notified about your slow responses. Setting up a response time alert requires the threshold in milliseconds. ### Error rate by status code When setting up error rate by status code you have to set the status code threshold that you'd like to alert for, the recommended setup here is 500 - 600 because all the server related errors are 5XX. If those happen that means something is wrong in your service's code. Error rate means that we take all your status code reports from your services, and see what is the rate of the bad status codes. ### Service downtime This is the most important alert to set up. Downtime means that your service was unable to report to the Trace by RisingStack servers. It could mean several things, maybe your app is unable to reach the internet, or it has crashed, or something went wrong during deployment and it's not working. These issues are critical so you probably want to be notified as soon as possible. You can set up downtime in minutes, note here that the minimum you should be setting this value is >3 minutes, otherwise it could be non-accurate. ### Memory If you know how much memory you have available you can set up alerting to scale up your machines, before they run out of resources. You can set the thresholds in megabytes. ## Setting up alert channels When you're setting up alerting feature, keep in mind that you have a warning and a critical level of alert. You can use it to set up different channels for different type of alerts, for example you don't want to wake up in the middle of the night when the alert is at warning level, but you'd like to get an email about it. You can set up different alert levels for different types of channels. When an alert happens, you want to get notified somehow, by default Trace by RisingStack provides an integrated incident UI, but if you'd like to integrate it with your existing tools, you can do that the following way: These are the integrations available in trace: - [Slack](http://slack.com) - [Pagerduty](http://pagerduty.com) - [OpsGenie](http://opsgenie.com) - Webhooks - Email ### Slack For Slack you have to go to your team's slack apps page and look for incoming webhooks. https://yourteam.slack.com/apps/A0F7XDUAZ-incoming-webhooks On the integration settings you can set up which channel you'd like to post to and other useful options, but the thing we need is the webhook URL. Copy that and insert it into the field on the Trace by RisingStack UI. ### OpsGenie The OpsGenie UI allows you to set up API integrations, on the sidebar select integrations and look for a cogwheel icon for the API integration, click on that and copy over the token from the Api-key field, into the Trace by RisingStack UI. ### Email You can set up an automated email notification from the Trace by RisingStack UI, the only thing you need to do is to enter an email address that you'd like to send to. [block:callout] { "type": "warning", "body": "You can only send notifications to a single email" } [/block] ### Webhook You can make Trace by RisingStack call an endpoint via HTTP. Just give it a URL and it's going to POST with a `Content-type: application/json` header. The payload will look like this: [block:code] { "codes": [ { "code": "{\n\t\"value\": 411,\n \"reason\": {\n \"type\": \"warningToError\",\n \"description: \"warning to error\"\n },\n \"alert\": {\n \"name\": \"Memory alert\",\n id: 23\n },\n \"infrastructure\": {\n \"name\": \"My infrastructure\",\n \"id\": \"aaab332abb1a11bbab\"\n },\n \"service\": {\n \"name\": \"My service\",\n \"id\": \"aaab3323aa1a11bbab\"\n }\n}", "language": "json" } ] } [/block] ## Actions You can set up actions that can be done when an alert gets triggered. ### Profiling actions In Trace by RisingStack you have the ability to create memory heapdumps and CPU profiles, this feature is integrated into the alerting feature too. If you add it to your alert, reaching a specific threshold can automatically dump your memory so you can inspect it later! Click on the warning or error checkbox to add the action to your alert. When it is triggered you'll be able to download and inspect on the profiling UI, that can be found via the stopwatch sign on the sidebar. [block:callout] { "type": "danger", "body": "Be careful, creating a memory heapdump in a running program is a very expensive operation, and it requires additional memory." } [/block] ## Setting up useful alerts ### Downtime With alerts you need to maintain a proper balance between keeping your services running and not being woken up during the night. In case of crucial web processes you might want to keep the downtime as low as possible in order to prevent losing customers. This means that you might want to set warning levels to 3-5 minutes and critical to 6-10 minutes. On the other hand some problems such as temporary connectivity issues get solved on their own by time. So eg. if you have workers that communicate through a messaging queue or by writing to database then polling it for new data, you might not want to be woken up every time a query timeouts or connection to the queue is lost. So in these cases you might safely get away with setting downtime warning level to 15 minutes and critical to 30 minutes. This way you can make sure to be alerted when things are actually going downhill, yet you can sleep through if network access is patchy but satisfactory. Obviously, your mileage may differ so use these values as starting points then experiment with the settings until you figure out which works the best for which service in your case.