Partitions

In the world of database administration, every ounce of available compute resources is precious. Queries that run too long or consume too much data can slow down the system significantly, or even lock it up altogether. There are multiple optimization techniques to solve for this problem, whether that is indexing, leveraging read-replicas, partitioning the data, or a combination of all of the above.

For many, the first step is setting up indexes on the data. This technique isn't anything particularly difficult to understand, since a data index operates the same as a book's index does. The database will organize the dataset in a certain order, track what that order is, and leverage that known order for future queries. Once an index is complete, the next step is partitioning the data.

A Partition is also easy to understand, since it behaves exactly as it sounds: You set up partitions in the dataset. Similar to how an index works (organize data in an order and keep track of that order), partitions separate out the data into groups, track where those groups are, and then only leverage the required groups for the query request.

Let's look at a quick example:
The dataset in question is a loyalty system dataset which tracks the first name, last name, loyalty number, purchase location, order amount, ordered items, and the transaction time.

As transactions happen, they are loaded into your database in the order they happened (generally speaking). If you wanted to look for a specific transaction from that last 10 days, your query might look like:

SELECT * FROM  
  transaction_table  
WHERE  
  transaction_id = 0000000000019847

But querying a couple million rows for just that one transaction requires the database to go through every single row until it finds the transaction, which could take a long time. IF you index the entire dataset on transaction_id, then the database would order the dataset in sequential order according to the transaction_id. Now when you query the data, the database knows where the transaction_id should generally be, making the query much faster. But, what if your data size continues to grow AND you want to do some analytics on the data? For example, you might want to get all of the revenue for the last day based on the store. That query would look something like:

SELECT  
  SUM(order_amount) as revenue,
  *
  FROM  
    transaction_table  
  WHERE  
    transaction_date LIKE '2023-08-23'

Even though you might have an index, all of the data in the whole dataset still has to be queried in order to get the SUM of the revenue. This is where partitions can be extremely helpful. If you setup an index on the transaction_date column in the dataset and also partition the dataset on transaction_date, then not only will the database order the transactions by the date, but now each day's worth of data is segmented. When the query is executed, the database can skip over all partitions that are not needed, load only the data in the appropriate partitions, and then finish out the analytics task. This massively speeds up how queries execute.

Lightup was built from the ground up to leverage these different optimization techniques so that all of the push-down queries are highly efficient. When the user configures a table or metric, they have the option of specifying the partition structure for Lightup to use. One thing to note is that, since Lightup is mainly executing time-range bound queries, the partitioning scheme that works is when the partition is set to a date/time oriented column.

Set up partitions for a table

You should set up suitable partitions when you configure a table. For steps, see Add a partition.