Manage incidents

When a monitor detects a variation that's out-of-bounds, it logs an incident. Each incident is specific to one monitor, and has a start time, a severity, a direction, a duration, and a status.

You have several options for reviewing incidents:

  • The Incidents list is a central place to work with all the incidents in the workspace.
  • The Incident dashboard provides summary information about a workspace's incidents, and links to the Incidents list so you can drill down.
  • The Data Quality dashboard summary scores indicate incidents indirectly: each monitor that logs at least one incident is excluded from the summary score for its dimension (e.g., if the Accuracy score is 10/11 (91%), one monitor attached to a metric that measures Accuracy logged one or more incidents). For more detail, select the summary score and then look at the heatmap: it shows two weeks' history of each monitor included in the summary score. Each day when an incident occurred appears as a red box on the heatmap.
  • The Explorer tree lets you review any incident by opening the related metric's chart: when you select a data asset that has recent incidents, an Incidents menu appears on every metric chart where a monitor logged an incident.

The Incidents list

With a workspace selected, select Incidents on the top bar to open the Incidents list.

14001400
  • Assess incidents at a glance— totals across the top display summary information about the incidents in the workspace. Note that these numbers are affected by any filters you set and are limited to the date range you specify.
  • Just above the list items, controls let you work with the list:
    • Filter by Direction, Severity, Status, or Validation Status — use Apply Filter and Clear all to set and unset the current list filters.
    • Search for an incident.
    • Adjust the date range of incidents to show.
    • Use Show x to set how many rows to display per page (in general, fewer is faster).
    • Select the gear icon to choose what columns to display.
    • Navigation arrows and page numbers let you jump around in the list.
  • In the list header, you have more options for working with list items:
    • Select the check box to select all items on the page.
    • Select a column heading to sort the list using the column's values in ascending order. Select it again to reverse the sort order. Select it a third time to stop sorting.
  • In the ID column, select a value to open the incident details.
  • In the Metric column, select a value to open that metric in the Explorer tree.
  • The Monitor column reflects the monitor that detected unexpected metric values.
  • Start Time shows the timestamp associated with the first unexpected metric value for the incident.
  • Duration is the amount of time elapsed from Start Time until the metric values returned to the expected range— it's how long the incident lasted (or has lasted if ongoing). Note that duration is affected by the monitor configuration settings, drift duration and recovery duration.
  • Direction shows whether the unexpected metric values were above or below expectations, or both.
  • Status indicates whether and how the incident has been handled. Most statuses are manually set, but Lightup automatically changes Unviewed (the default starting status) to Viewed as soon as anyone opens the incident details page.
  • Severity reflects the degree to which the metric's values deviated from expectations. Internally, Lightup assesses the severity on an inverse 1-10 scale (1 is most severe); values up to but not including 4 are labeled High severity, and values 4 and above are labeled Medium.
  • The remaining columns reflect the data asset that produced out-of-bound metric values. Depending on the kind of data asset the metric measures (schema, table, column), some of these columns may be empty.

The Incident dashboard

For an overview of a workspace's incidents, on the Dashboards tab select Incident.

19091909
  • On the left, tiles indicate the number of incidents with various statuses. If the number isn't zero, the tile is also a filtered link to the Incidents list. Select a tile to open the Incidents list and display items with the indicated status. Note that Resolved is actually a validation status, so the filters have some overlap (e.g., an incident can be Submitted and Resolved at the same time).
  • On the right, you can specify a date range at the top, and review the following charts:
    • Incidents plots the count of incidents over time.
    • Incidents by incident status also shows counts over time, but as stacked bars (one per incident status). Select a status in the legend at the bottom of the chart to toggle the corresponding bars on or off. For example, if all statuses are displayed and you want to hide unviewed incidents, select Unviewed in the legend— select it again to unhide them.
    • Resolved incidents plots the count of resolved incidents over time. For information about resolving incidents, see validate an incident fix.
    • Incidents by severity shows counts of incidents over time as stacked bars (one per severity). Select a severity in the legend at the bottom of the chart to toggle the corresponding bars on or off.

Validate an incident fix

After you review an incident and take actions to correct the data quality issue, you can validate that your fix addresses the incident— that the data quality issue is resolved.

🚧

Validation is not supported for real-time metrics—metrics where the x-axis of the metric chart shows when the value was calculated. Currently, this includes:

  • Activity metrics for schemas and tables
  • Data delay metrics
  1. On the incident details page, select Run validation. Lightup begins checking to see whether the data quality incident is fixed, and the button text changes to Running validation....
  2. To cancel, select ... → Cancel validation.
12801280

After validation finishes running, the button text changes again to reflect the outcome:

  • Resolved if the issue is fixed
  • Unresolved if the issue is still present

👍

If validation indicates the incident is resolved, consider changing the incident status to Closed.