Publish data to Azure ADLS Gen2 from Delta Live Tables pipelines?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-06-2022 11:23 AM
I would like to publish data from from delta live table (DLT) to an Azure ADLS Gen2 storage. Checked the documentation but cannot get it to work. Any insights on how to do so?
I've tried to specify the "Storage location" with many combinations of abfs://root@storageaccountname.dfs.core.windows.net/dev/delta_live_tables/ and also abfss://root@storageaccountname.dfs.core.windows.net/dev/delta_live_tables/ without any success.
Only succeeded to write to hive_metastore and dbfs so far. But I want to write to an external location (ADLS Gen2) so the tables can be referenced by Unity Catalog.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-06-2022 11:38 AM
if you are using abfs protocol , you need to configure some more configurations to read file from abfs protocol , so firstly mount a databricks worspace with that location then you can use abfs protocol i guess ,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-06-2022 11:45 AM
I can use write and read from ADLS Gen2 using both the abfs and abfss protocol from regular notebooks. But for some reason it is not working for DLT.
Do you know if there is a need to mount ADLS Gen2 to make it work specifically with DLT?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-06-2022 11:57 AM
Mounting an ADLS container is no longer recommended. If you want DLT to materialize your data in ADLS, you need to do two things:
- In DLT Pipeline settings, configure ADLS credentials using either SAS token or Service Principal. See here for more details. If you are using SAS token, we also recommed using Secrets to store the key. See here for a concrete example.
- Specify ADLS location as your DLT pipeline storage path. e.g. abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/my_folder/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-07-2022 12:59 AM
In the settings of DLT you need to go to Compute -> advanced and aff configuration (of course, each key needs a value required in your case):
fs.azure.account.auth.type OAuth
fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id {{secrets/scope/databricks-clientid}}
fs.azure.account.oauth2.client.secret {{secrets/scope/databricks-secret}}
fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<YOUR_APP_ID>/oauth2/token
if you have the problem you can prefix all keys with spark_conf.spark.hadoop
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-09-2022 06:45 AM
Great, will try this. Thanks a lot!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2022 02:50 AM
Finally got it to work and here's what we did:
Pre-conditions:
- Service Principal created
- Secret created in Azure KeyVault
- Scope & Secret created in Databricks (Append "#secrets/createScope" in the url to get the databricks page where you create secrets)
Note:
- Only one DLT pipeline per external storage location
- The keys, related to ADLS access, are prefixed with "spark.hadoop."
- You won't get lineage in Unity Catalog when using DLT, however, you see the lineage in DLT instead.
DLT Pipeline settings - JSON (example of what worked, substitute for your own parameter values)
{
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5,
"mode": "ENHANCED"
}
},
{
"label": "maintenance"
}
],
"development": true,
"continuous": false,
"channel": "PREVIEW",
"edition": "ADVANCED",
"photon": false,
"libraries": [
{
"notebook": {
"path": "/Repos/repo_folder/delta-live-tables-notebooks/sql/Retail Sales"
}
}
],
"name": "Retail Sales SQL External Tables",
"storage": "abfss://<container>@<azure-storage-account>.dfs.core.windows.net/delta_live_tables/",
"configuration": {
"spark.hadoop.fs.azure.account.auth.type": "OAuth",
"spark.hadoop.fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"spark.hadoop.fs.azure.account.oauth2.client.id": "<application-id>",
"spark.hadoop.fs.azure.account.oauth2.client.secret": "{{secrets/<scope-name-in-databricks>/<secret-name-in-azure-keyvault>}}",
"spark.hadoop.fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/"<directory-id>"/oauth2/token"
}
}
Unity Catalog - Register a new External Table
CREATE TABLE IF NOT EXISTS customers
LOCATION 'abfss://<container-name>@<-storage-account-name>.dfs.core.windows.net/-folder/tables/customers/'
Summary - How to get DLT & Unity Catalog to Work Together
This way you can get Delta Live Tables (DLT) to work together with Unity Catalog by referring to external tables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-25-2023 06:37 PM
Thanks for the write up. Worked for me. Question. In my case, I let DLT write its tables to an abfss://data@... storage path using your config pattern above. It successfully writes the tables, but when I view the tables in hive_metastore in data explorer, I get the fs.configuration error. If I manually run the oauth routine to connect to that account, it connects, but I still can't view the data in data-explorer. However, I can create an external table in Unity just fine (as it's using the unity storage credentials).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-01-2023 03:07 PM
I am trying to load from ADLS with DLT, after add the spark configurations I got below errors:
org.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = 818323fc-80d5-4833-9f46-7d1afc9c5bf7, runId = 722e9aac-0fdd-4206-9d49-683bb151f0bf] terminated with exception: The container in the file event `{"backfill":{"bucket":"root@dbstoragelhdp7mflfxe2y","key":"5810201264315799/Data/Temp/test_account.csv","size":1801,"eventTime":1682522202000,"newerThan$default$2":false}}` is different from expected by the source: `my_container@my_storageaccount`.
any sequestions?
Thanks,

