Profile your data

To better understand your data and help enhance your data quality analysis, enable data profiling. Data profiling provides a static analysis of your tables that you can view in the Explorer tree.

You enable data profiling when you add data assets to Explorer. The initial data profile uses up to one million rows to create each table's data profile, as follows:

  • For a table with under a million rows, the data profile covers the entire table.
  • If Lightup can determine which timestamp column to use for time-range queries and the table has at least a million rows, the data profile covers the latest 30 days worth of data.
  • If Lightup can't tell which timestamp column to use for time-range queries and the table has at least a million rows, the data profile covers one million randomly-selected rows.

If you aren't sure what types of data quality checks you want, we recommend you enable data profiling to learn about your data. Reviewing these data profiles will give you the info you need to proceed with your data quality journey.

Enable data profiling on the Data Profile tab

  1. Select a table in the Explorer tree.
  2. In the main window, on the Data Profile tab, select Enable data profile.

Enable data profiling when you add assets

After you add your datasources to the Explorer tree, you then add their data assets, and can enable data profiling in the process.

  • Add the schemas that contain the tables you want to profile.
  • Add the tables that you want to profile. For each table, enable profiling by setting its Profile toggle in the Manage tables modal. Lightup then starts profiling the table in the background. When you select a profiled table in the Explorer tree, you see the data profile for the table. (Note that if you also enable autometrics for the table, their charts will display instead of the data profile.)
  • Add columns from your profiled tables to your Explorer tree. You can select a column in the Explorer tree to see its data profile. Note that you don't have to add columns in order for them to be profiled, just to make them accessible in the Explorer tree. All columns are included in the data profile.

Columns detected during profiling

Lightup uses pattern detection to determine the contents of table columns. In addition to timestamps, Lightup detects UUIDs, phone numbers, US ZIP codes, SSNs, and URLs. You can hover over the label Patterns Detected on a column profile for a quick reminder.

Review data profiles

After data profiles are generated, you can review them to help you plan your data quality analysis. For example, reviewing a table's profile will clarify which columns can be used for timestamps in metrics.

When you select a table in the Explorer tree, two tabs display details about the table: Data Profile and Auto Metrics. The Data Profile tab displays the table's data profile:

Each data profile is laid out in sections: Data profile, Table Schema, Timestamp Analysis, Sample Data, and profiled columns (section not labeled).

Data profile

The Data profile provides a list of basic facts about the table, as well as when the profile was generated.

You can also generate a new profile, replacing the current one. Only one data profile can exist for a table— there's no way to save a copy of or export a data profile.

Generate a new profile

  1. In the profile, next to Data profile select Configure. A subform appears just above the table summary.
  2. Select a Mode:
    • Full Table - Use the entire table to create the data profile.
    • Random Sample - Use one million randomly-selected rows to create the data profile.
    • Time Range - Specify a period of time and a timestamp column, then use all rows having timestamps within that period to create the data profile.
  3. Select Update to generate the new profile.

Table Schema

The Table Schema section displays a list of the table's columns. For each column, it lists the column name, data type, and any detected metadata (currently, is_Indexed and is_PartitionKey are detected).

Timestamp Analysis

The Timestamp Analysis displays a chart showing the number of rows (y-axis) associated with the timestamp value on the x-axis. You can hover over a timestamp value on the chart to see more details.

Below the chart is the timestamp data delay, which is the profile collection time minus the maximum timestamp value in the profiled data.

Sample Data

Five rows of sample data are displayed to help you understand the table's contents. Data for all columns is included— you might need to scroll right to review them all.

Profiled columns

Profiles for the table's columns are displayed below the Sample Data, in the order they appear in the table. Different data types produce different profiles, with statistics relevant to the data type. In addition, if Lightup detects a pattern in a string column, that pattern is noted in the column's profile.

To select specific profiled columns to display, click Select column name and then choose columns from the list that appears.

Column profile information

All column profiles display percentages of Valid and Invalid column values. Valid values are further broken down into Unique and Nonunique percentages. The only Invalid values are nulls (absent data), so Null is equal to Invalid. The sum of Valid and Invalid is always 100%.

Numeric column profiles display the Min, Lower quartile (Q1), Median, Mean, Upper quartile (Q3), and Max values (standard statistics that describe the distribution of values), a graph , as well as the percentages of column values that are Zero %, Positive %, Negative %, and Nan% (not a number). Below these statistics is a value-distribution chart for the column.

String column profiles display the Max string length, Min string length, and Mean string length, as well as the Zero Length Count (the count of column values having zero length strings), any Patterns Detected, and the # of Distinct Values (i.e., COUNT DISTINCT). Although there is no limit on the number of distinct values in a column profile, if there are fewer than 50 distinct values a chart displays the distribution of the top 20 values.

Timestamp column profiles display the Mean Interval (the average time elapsed between two consecutive timestamps), Min Interval (the smallest amount of time elapsed between two consecutive timestamps), and Max Interval (the largest amount of time elapsed between two consecutive timestamps), the Min Timestamp, the Max Timestamp, and the Future Date % (the percent of values where the date is in the future). Below these statistics is a chart of recent timestamps— you can hover on the chart to see details.