Incidents

Incidents are created when a connected service sends us an alert or an existing monitor or a heartbeat fails.

What are Incidents used for

Incidents represent time-sensitive threats to your infrastucture such as URL downtime, missed heartbeat or a low-level infrastructure alert.

The current on-call person is alerted when Incident is created.

If the current on-call person doesn't acknowledge the incident within a specified time period the entire team is alerted.

You can configure on-call escalations and escalation periods for each Monitor and Heartbeat separately.

Example

We are monitoring facebook.com every 30 seconds.

Our on-call escalation is configured to alert the entire team if the current on-call person doesn't acknowledge the incident within 3 minutes.

  • facebook.com goes down at 3:25AM

  • New incident is created

  • Better Uptime calls, send an SMS and an e-mail to the current on-call person

  • the current on-call person is asleep

  • they don't acknowledge the incident and continue dreaming 😴

  • after 3 minutes Better Uptime alerts (call, SMS, e-mail) the entire team

Acknowledging the incident

After you acknowledge the incident no other team members get alerted.

You can manually acknowledge the incident in the upper-right corner on the incident detail page.

Acknowledging via a phone call

When Better Uptime calls you, you are prompted to press 1 to acknowledge the incident.

If you don't want to acknowledge the incident — you may be without your computer and can't start resolving the problem right away — just hang up and other team members will get alerted.

Acknowledging via an e-mail

To acknowledge the incident click the Acknowledge incident button in the e-mail you receive when a new incident is started.

Screenshot of a new incident email

Resolving an incident

Once the incident is acknowledged you will be able to resolve it.

Incidents are automatically resolved after the endpoint becomes available again.

You can manually resolve the incident by clicking Resolve in the upper-right corner of the incident detail page or wait until.

Screenshots and Responses

We take a screenshot and save a raw response of your website every time an incident caused by downtime happens. They can be extremely useful when figuring out exactly what happened.

You can find the screenshot and the response in the headline on an incident detail page.

We may not take a screenshot in some circumstances when they are not available. Example: when a response timeout is reached, no request response is typically present.

Comment on incidents

You can collaborate on resolving an incident with your colleagues using comments. Upload screenshots, share your insights, and collaborate on resolving the incident together.

You can use Markdown in the comments.

Writing post mortems

Post mortems are short summaries of incidents.

They typically describe why an incident happened, estimated cost, and how to prevent similar incidents in the future.

The best teams write and share their post mortems after each significant incident.

To write a post mortem, just comment on the incident including "post mortem" in the comment. See the example below.

Example post mortem comment