Imagine a world where building complex data pipelines is as simple as describing what you want rather than meticulously coding how to do it. This is one of the key propositions that Delta Live Tables (DLT) brings to the table. As a declarative framework, DLT transforms how we approach ETL (Extract, Transform, Load) processes by specifying what needs to be done instead of how it needs to be done. Additionally, It addresses the common headaches of data engineering – complex code, manual optimizations, dependency management, unifying batch and streaming, enhanced auto-scaling, infrastructure management, etc – by offering a more simplified yet efficient method to create reliable data pipelines. While Delta Live Tables (DLT) optimizes and simplifies many aspects, there are still strategic techniques and best practices you can employ to make your pipelines even more efficient. In this blog, we'll dive into the top tips and tricks to help you get the most out of DLT, ensuring your data workflows are as robust and performant as possible.
If you've been using Delta Live Tables (DLT) to build robust data pipelines, chances are you’re already familiar with some of the core best practices.
With the basics out of the way, let’s look at the next set of top 5 tips to build DLT pipelines optimally.
If you already use DLT serverless, you are a rockstar and can skip this tip. In case you are still using DLT classic, which runs on a cluster of VMs, you can make use of the Pipeline compute settings to further optimize pipelines.
One of the biggest advantages Serverless provides is reduced start-up time. Instead of waiting for minutes for your VMs to be launched in your cloud account and be ready to run your DLT pipeline, which may take minutes, Serverless reduces it to seconds. For latency sensitive workloads, this is quite important. However, if you are using DLT classic, you can still reduce cluster start-up time significantly by using cluster pools.
Databricks Cluster Pools optimize cluster startup times for workloads by maintaining a cache of pre-warmed virtual machine instances, allowing clusters to quickly acquire resources without waiting for new instances from the cloud provider. This feature significantly reduces startup time, particularly beneficial for short, automated jobs. Pools manage the lifecycle of these instances efficiently, offering cost savings by only incurring cloud provider costs for idle instances without additional Databricks Unit (DBU) charges. By enabling faster startup and dynamic scaling, Databricks Pools enhance the performance and efficiency of data processing tasks in data pipelines. One call-out is that tags are managed slightly differently in Pools. This documentation link contains more details on that.
If you have already created a pool, you can specify it in the DLT pipeline settings. Although this option is not available in the UI, you can simply toggle from UI to the JSON view of the DLT pipeline settings and specify it there. For instance, here's a JSON view of a DLT pipeline that uses cluster pools for its compute as signified by instance_pool_id key:
|
One key call-out here is that while creating pools, specify None in the "Preloaded Databricks Runtime Version" setting.
 Although selecting this option helps speed up cluster launches as the DBR is pre-loaded on idle instances in the pool, DLT uses its own custom DBR, so the option shouldn't be selected.
If you are running a lot of flows (i.e., processing streaming tables and materialized views) in your DLT pipeline, the driver node can come under strain. As DLT is built on the battle-tested foundations of structured streaming, the driver node plays a crucial role in streaming workloads such as:
If you observe in cluster metrics section that the CPU and memory utilization of driver node is exhausting, it could be due to the above mentioned reasons. Thus, in such circumstances, it is recommended to use a bigger driver node. As mentioned in the documentation, you can specify compute settings for the clusters. Specifically, you can specify all the cluster attributes except a few. A good reference is the create cluster REST API endpoint, which lists the cluster attributes one can use. Using this, one can specify a bigger driver node as per requirement in the clusters part of your pipeline settings JSON:
|
The same concept, i.e. using the compute settings of DLT pipeline, can be used to achieve the following benefits:
If you use Production mode in DLT and want to reuse the cluster for subsequent updates, it's not possible by default, as it launches a new cluster for every update. This ensures recycling so that a freshly provisioned cluster is available to run each update. However, it adds an overall delay in execution as for each update, there will be a cluster-startup time (which can be reduced with cluster pools to a greater extent). Thus, if you want to reuse the existing cluster in Production Mode, you can specify a higher value for this config in Pipeline configuration, e.g. 120 minutes:
|
By default, for Production Mode, it is 0 seconds, so it launches a new cluster for every update (for development, the default value is 2 hours already).
Writing or persisting data intermittently, especially when it is not necessary, is generally discouraged while building ETL pipelines. This can significantly increase storage costs as more space is consumed over time, and it can introduce latency. Additionally, each write operation consumes compute resources, potentially affecting other operations and overall system performance.
In DLT, you primarily work with three types of objects w.r.t persistence:
Thus, if you have a requirement where you want to specify a transformation logic on tables to be used downstream, and you don't need that dataset to be published to UC, it is recommended to adhere to using Views instead of the other two options based on the reasons mentioned previously.
Here is an example scenario. Consider that you are using Auto Loader to incrementally process raw files and create bronze tables with minimum to zero transformations, thus closely resembling the source. However, you want to apply certain transformations, e.g., timestamp conversions, type-casting, etc. before merging into a target table using SCD Type 1 or 2 employing APPLY CHANGES INTO API. In this case, you can define a view on the bronze table, specify the necessary logic, and then use that view as a source for APPLY CHANGES INTO. This will prevent persisting the intermediate dataset primarily used for the merge operation.
|
The Databricks platform is constantly undergoing significant simplification, largely driven by AI optimizations powered by DatabricksIQ, which aim to eliminate configuration complexities and reduce maintenance overhead. Delta Live Tables also embodies this approach by abstracting many configurations, allowing data engineers to focus on delivering value. However, there are still scenarios where manual performance tuning is necessary to achieve optimal results.
While defining a DLT table, it is possible to specify a number of properties that, if configured properly, can positively improve DLT pipelines’ performance. For instance:
Additionally, the following Delta properties have been observed to be commonly used as well:
Managing table properties can become tedious when you need to set them individually for every table. Often, users want a way to apply default properties across all tables at the pipeline level, avoiding repetitive code. Unfortunately, setting global table properties directly through Spark configurations doesn’t work with DLT. To solve this, we can create a custom wrapper around the dlt.table decorator that automatically injects a set of default table properties while still allowing flexibility for specific tables to override them. This approach streamlines the pipeline configuration, making it easier to manage and maintain. Below is an example of how to implement this solution effectively in Python:
|
A flow is a streaming query designed to process source data incrementally, updating a target streaming table without needing to reprocess the entire dataset. This incremental approach can be highly efficient, especially when working with large datasets or continuous streams of data. Flows can operate automatically in some cases, for instance, when you create a query that directly updates a streaming table, a flow is often generated behind the scenes without needing explicit configuration.
Understanding the mechanics of flows is crucial for optimizing your data pipelines. Append flows allow data to be added or merged incrementally rather than requiring a full-table refresh every time new data becomes available. This is particularly useful for real-time or near-real-time applications where continuous updates to data are required. By leveraging append flows, you can significantly reduce processing overhead and cost, while also improving the responsiveness of your pipeline.
Scenario 1: When managing data streams, business needs or data requirements often evolve. For example, you may start with a few streaming data sources but find the need to integrate additional streams over time. Instead of creating a brand new streaming table and reprocessing all the data (which can be computationally expensive and time-consuming), you can use an append flow to simply add the new streaming sources to the existing table. The append flow integrates these new sources incrementally, without having to perform a full refresh. This is particularly helpful for large or continuously growing datasets, as it ensures minimal disruption while keeping your pipeline adaptable to changes in data sources.
|
Scenario 2: In traditional pipelines, combining multiple datasets often requires performing a UNION operation, which consolidates rows from different queries into a single result set. While effective, UNION operations can be computationally expensive, especially if you’re working with large datasets or performing frequent updates. Instead of relying on costly UNIONS, append flows allow you to achieve the same result by incrementally merging new data into the target table as it becomes available. This process reduces the need for full-refresh updates, meaning that only the new data is processed and appended. The result is a much more efficient pipeline that saves on both time and computational resources, while still maintaining up-to-date results in the target table.
|
In both scenarios, append flows improve the scalability and efficiency of data pipelines by minimizing the need for full-table updates. This can lead to significant cost savings, especially when dealing with cloud resources or large-scale, real-time data processing systems.
Delta Live Tables allow us to use Auto Loader for most data ingestion tasks from cloud object storage. Auto Loader and Delta Live Tables work together to incrementally and idempotently ingest and process data as it arrives.
To keep track of new files, Auto Loader employs several different methods to monitor the input directory for new events . By default, this is done through Directory Listing. The other approach is to use the File notification mode.
Directory Listing Mode operates by periodically scanning the input directory to detect new files for ingestion. It relies solely on listing files without the need for additional cloud services or event-based notifications, which makes it easy to set up and configure. One of the main benefits of directory listing is the ability to control ingestion frequency. You can schedule scans based on your data processing needs, which is useful if you want to limit the volume of data being ingested at once. Additionally, directory listing is efficient when dealing with large files, as fewer files mean less frequent scans and lower processing overhead. However, there are some drawbacks to this method, such as latency—since it scans at intervals, there can be delays between file arrival and ingestion. This method also lacks scalability, as large numbers of files can lead to longer scanning times and higher compute costs. Directory listing is best suited for scenarios involving large files that don’t require real-time processing or when you need to control ingestion timing. Also note that your cloud provider may charge you for LIST operations on your cloud storage, making this an expensive option for low latency, high volume use cases.
|
File Notification Mode, on the other hand, offers a more scalable and performant solution for real-time, high-volume data ingestion. Instead of scanning the directory, this mode uses event-driven notifications from cloud storage to detect new files as soon as they arrive, making it ideal for real-time ingestion where low latency is essential. This method is especially efficient for large volumes of files, as it avoids the overhead of scanning directories and instead relies on notifications for immediate processing. File notification integrates with cloud services like AWS SQS, Azure Event Grid, and Google Pub/Sub, but it requires a more complex setup and monitoring of resource limits, such as the number of queues or notifications allowed by your cloud provider.
Note that autoloader automatically sets up a notification and queue service that subscribes to file events, for single user compute only. For use with DLT, manual setup is needed. You can then specify the configuration as part of your pipeline:
|
Refer to the summary table below to determine which mode is better suited for your use case:
Directory Listing |
File Notification Mode |
|
---|---|---|
Ingestion Type |
Periodic, scheduled ingestion |
Event-driven, real-time ingestion |
Setup Complexity |
Simple, no additional services required |
Complex, requires configuring cloud-based notifications |
Latency |
Higher latency due to periodic scanning |
Low latency, immediate file detection |
Efficiency with Small Files |
Less efficient, scans entire directory |
Highly efficient, processes only new files |
Efficiency with Large Files |
More efficient, suited for larger files |
Can handle large files, but more effective with small files |
Scalability |
Limited scalability as the number of files increases |
High scalability for large directories and frequent file events |
Delta Live Tables (DLT) offers a simplified approach in building efficient data pipelines through a declarative framework, minimizing the need for complex coding and manual optimizations. Key tips for optimizing DLT pipelines include using Databricks’ Photon engine for enhanced processing, leveraging serverless architecture for reduced startup times, and taking advantage of auto-scaling features. Other strategies involve optimizing pipeline compute settings by utilizing cluster pools and bigger driver nodes, avoiding unnecessary data persistence, and fine-tuning performance with specific table properties. Additionally, using flows and append operations for efficient data streaming and choosing the right file detection mode—whether directory listing or file notification—can significantly improve the scalability and cost-efficiency of data pipelines. These best practices ensure robust, high-performing ETL processes with DLT.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.