Partitions

In the world of database administration, every ounce of available compute resources is precious. Queries that run too long or consume too much data can slow down the system significantly, or even lock it up altogether. There are multiple optimization techniques to solve for this problem, whether that is indexing, leveraging read-replicas, partitioning the data, or a combination of all of the above.

For many, the first step is setting up indexes on the data. This technique isn't anything particularly difficult to understand, since a data index operates the same as a book's index does. The database will organize the dataset in a certain order, track what that order is, and leverage that known order for future queries. Once an index is complete, the next step is partitioning the data.

A Partition is also easy to understand, since it behaves exactly as it sounds: You set up partitions in the dataset. Similar to how an index works (organize data in an order and keep track of that order), partitions separate out the data into groups, track where those groups are, and then only leverage the required groups for the query request.

Let's look at a quick example:
The dataset in question is a loyalty system dataset which tracks the first name, last name, loyalty number, purchase location, order amount, ordered items, and the transaction time.

As transactions happen, they are loaded into your database in the order they happened (generally speaking). If you wanted to look for a specific transaction from that last 10 days, your query might look like:

SELECT * FROM  
  transaction_table  
WHERE  
  transaction_id = 0000000000019847

But querying a couple million rows for just that one transaction requires the database to go through every single row until it finds the transaction, which could take a long time. IF you index the entire dataset on transaction_id, then the database would order the dataset in sequential order according to the transaction_id. Now when you query the data, the database knows where the transaction_id should generally be, making the query much faster. But, what if your data size continues to grow AND you want to do some analytics on the data? For example, you might want to get all of the revenue for the last day based on the store. That query would look something like:

SELECT  
  SUM(order_amount) as revenue,
  *
  FROM  
    transaction_table  
  WHERE  
    transaction_date LIKE '2023-08-23'

Even though you might have an index, all of the data in the whole dataset still has to be queried in order to get the SUM of the revenue. This is where partitions can be extremely helpful. If you setup an index on the transaction_date column in the dataset and also partition the dataset on transaction_date, then not only will the database order the transactions by the date, but now each day's worth of data is segmented. When the query is executed, the database can skip over all partitions that are not needed, load only the data in the appropriate partitions, and then finish out the analytics task. This massively speeds up how queries execute.

Lightup was built from the ground up to leverage these different optimization techniques so that all of the push-down queries are highly efficient. When the user configures a table or metric, they have the option of specifying the partition structure for Lightup to use. One thing to note is that, since Lightup is mainly executing time-range bound queries, the partitioning scheme that works is when the partition is set to a date/time oriented column.

Add partitions

  1. To add a partition column, click Add + and then select the column from the drop-down list. Partition columns are labelled (partition) in the list.
  2. Specify the partition's format using Python format codes. When Lightup generates a query for a time interval, it will convert the start and end of that time interval to the partition date format and generate a where clause that queries only the partitions where the data is expected to reside. Consider the following examples to better understand how partition formats work.
    • Example 1: If you have a partition column called month and it contains a two-digit month number, you can enter %m for the format. If Lightup is querying the date range 03/07/2023 through 03/13/2023, the following WHERE clause will be included in the query:
      WHERE `month` in (03)
      
    • Example 2: If you have three partition columns called year (a four-digit year), month (a two-digit month), and day (a two-digit day), you can enter %Y for year, %m for month, and %d for day. If Lightup is querying the date range 03/07/2023 through 03/13/2023, the following WHERE clause will be included in the query:
      WHERE
      			`year` in (2023)
      			AND `month` in (03)
      			AND `day` in (07, 08, 09, 10, 11, 12, 13)
      
    • Example 3: If you have a partition column called datehour that contains the date and hour, in a format such as 03/07/2023/18, you can enter %y/%m/%d/%H for the format. If Lightup is querying the time range 03/07/2023 18:00 through 03/08/2023 18:00, the following WHERE clause will be included in the query:
      WHERE
      		CAST(`datehour` AS STRING) in (
      				'2023/03/07/18',
      				'2023/03/07/19'
              ...
              '2023/03/07/23',
              '2023/03/08/00',
              '2023/03/08/01',
              ...
              '2023/03/08/18'
      			)
      
  3. Click Advanced to configure the Timezone, Partition Offset Past and Partition Offset Future settings for partitions. Partition Offset Past and Partition Offset Future let you add partition offsets to tell Lightup to query adjacent partitions associated with timestamps in the past or future.
    • In Example 2 above (with year, month, and day partitions), if Partition Offset Past is set to 2 days, the WHERE clause will include two additional days in the past (05 and 06):
      WHERE
      			`year` in (2023)
      			AND `month` in (03)
      			AND `day` in (05, 06, 07, 08, 09, 10, 11, 12, 13)
      
    • In Example 3 above (with a partition column called datehour), if Partition Offset Past is set to 2 days in the past, the WHERE clause will include the full two-day period, hour by hour for the two days prior to the query period, resulting in:
      WHERE
      		CAST(`datehour` AS STRING) in (
              '2023/03/05/18',
              ...
              '2023/03/05/23',
              '2023/03/06/00',
              ...
              '2023/03/06/23',
              '2023/03/07/00',
              ...
      				'2023/03/07/18',
      				'2023/03/07/19'
              ...
              '2023/03/07/23',
              '2023/03/08/00',
              '2023/03/08/01',
              ...
              '2023/03/08/18'
      			)
      

Use partitions to optimize an SQL metric

If your table has a date-based partition column, you can optimize your query to query only the partitions where the data resides. To do this, you need to do two things:

  1. Add conditions to your WHERE clause which restrict the query to querying only the partitions where you expect your data to reside. This is done by adding the special tags {start_partition} and {end_partition} to your WHERE clause
  2. Specify the format of that will be generated for {start_partition} and {end_partition} by filling in the Partition format input.This enables Lightup to convert {start_partition} and {end_partition} into string representations of the start and end of the query period using the date format specified in the Partitions input

For example, if Partition is set to "%Y-%m-%d-%H" and the query time range is from 3/1/2023 15:00:00 to 3/4/2023 17:00:00, {start_partition} will be set to 2023-03-01-15 and {end_partition} will be set to 2023-03-04-17. Let's assume you have a column in your dataset that is in this format, called datehour. You could now restrict your query to look only in the appropriate partitions by adding the following to your WHERE clause:

AND `datehour` >= {start_partition}
AND `datehour` <= {end_partition}

This will resolve to

AND `datehour` >= '2023-03-01-15'
AND `datehour` <= '2023-03-04-17'

It's important to note that this is a string comparison. This means that the comparison is alphabetical and in order for it to match, the value of your column (datehour in this case) must be lexicographically ordered within the start and end range specified by {start_partition} and {end_partition}. If your datehour column has a format that is different from the format that you specify as the partition format, the results will not make sense. For example, if datehour has format 2023-Mar-01-15, it will clearly not be found between 2023-03-01-15 and 2023-03-04-17.

It's also important to note that in order to use >= and <= as shown in the example above, the partition format must be lexicographically sortable. For example, if your column has the format 2023-1-15 (single digit for month), you cannot specify this format as your partition format because a lexicographical sort would result in 2023-12-25 being found between 2023-1-15 and 2023-2-15, which is clearly not correct.