08-16-2022 01:29 AM
You’ve gotten familiar with Delta Live Tables (DLT) via the quickstart and getting started guide. Now it’s time to tackle creating a DLT data pipeline for your cloud storage–with one line of code. Here’s how it’ll look when you're starting:
CREATE OR REFRESH STREAMING LIVE TABLE <table_name>
AS SELECT * FROM cloud_files('<cloud storage location>', '<format>')
The cloud storage locations could be AWS S3 (s3://), Azure Data Lake Storage Gen2 (ADLS Gen2, abfss://), GCP Cloud Storage (GCS, gs://), Azure Blob Storage (wasbs://), ADLS Gen1 (adl://). Databricks File System (DBFS, dbfs:/) is also an option, but it’s not recommended for production pipelines.
Check out these 5 tips to get DLT to run that one line of code.
1. Use Auto Loader to ingest files to DLT
2. Let DLT run your pipeline notebook
3. Use JSON cluster configurations to access your storage location
4. Specify a Target database for your table(s)
5. Use the ‘Full refresh all’ to pull DLT pipeline code and settings changes
Tip #1: Use Auto Loader to ingest files to DLT
Knowledge check: What is Auto Loader?
Auto Loader provides a Structured Streaming source called cloud_files. Given an input directory path on the cloud file storage, the cloud_files source automatically processes new files as they arrive, with the option of also processing existing files in that directory. Auto Loader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT and BINARYFILE file formats. Auto Loader has support for both Python and SQL in Delta Live Tables.
Example: Auto Loader with S3
CREATE OR REFRESH STREAMING LIVE TABLE my_S3_data
AS SELECT * FROM cloud_files('s3a://your_datbase_name', 'json')
Your next steps
More resources
Tip #2: Let DLT run your pipeline notebook
Knowledge check: What is DLT?
Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling. Read more in the Delta Live Tables introduction [AWS] [Azure][GCP].
Example
Your next step
More resources
08-16-2022 01:36 AM
Tip #3: Use JSON cluster configurations to access your storage location
Knowledge check: How do I modify DLT settings using JSON?
Delta Live Tables settings are expressed as JSON and can be modified in the Delta Live Tables UI [AWS] [Azure][GCP].
Example: Add an S3 instance profile to the DLT Cluster Config via JSON
"clusters": [
{
"label": "default",
"aws_attributes": {
"instance_profile_arn": "arn:aws:..."
},
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
]
Your next step
More resources
Tip #4: Specify your Target database for your table(s)
Knowledge check: Why set a Target?
Add the Target setting to configure a database name for your tables. Setting a Target makes using your new table(s) easier after you start the pipeline. If you don’t set a Target on pipeline creation in the UI, you can go back and set a Target in the JSON.
Examples
UI to set the target for a new pipeline
JSON to edit the target of an existing pipeline (See Tip #3)
Your next step
select * from my_database.table_name
More resources
Tip #5: ‘Full refresh all’ pulls pipeline code and settings changes
Knowledge check: What are Pipeline Updates?
After you create the pipeline and are ready to run it, you start an update. An update does the following:
Example
More resources
So, how’s your DLT + cloud storage running? Drop your questions and tips in the thread! 🧵
10-13-2024 05:43 AM
Hi MadelynM,
How should we handle Source File Archival and Data Retention with DLT?
Source File Archival: Once the data from source file is loaded with DLT Auto Loader, we want to move the source file from source folder to archival folder. How can we do that?
Data Retention: Based on some requirement, only keep data for 90 days in Lakehouse populated with DLT. How could we achieve it with DLT?
Thank you!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group