12-06-2022 11:23 AM
I would like to publish data from from delta live table (DLT) to an Azure ADLS Gen2 storage. Checked the documentation but cannot get it to work. Any insights on how to do so?
I've tried to specify the "Storage location" with many combinations of abfs://root@storageaccountname.dfs.core.windows.net/dev/delta_live_tables/ and also abfss://root@storageaccountname.dfs.core.windows.net/dev/delta_live_tables/ without any success.
Only succeeded to write to hive_metastore and dbfs so far. But I want to write to an external location (ADLS Gen2) so the tables can be referenced by Unity Catalog.
12-06-2022 11:38 AM
if you are using abfs protocol , you need to configure some more configurations to read file from abfs protocol , so firstly mount a databricks worspace with that location then you can use abfs protocol i guess ,
12-06-2022 11:45 AM
I can use write and read from ADLS Gen2 using both the abfs and abfss protocol from regular notebooks. But for some reason it is not working for DLT.
Do you know if there is a need to mount ADLS Gen2 to make it work specifically with DLT?
12-06-2022 11:57 AM
Mounting an ADLS container is no longer recommended. If you want DLT to materialize your data in ADLS, you need to do two things:
12-07-2022 12:59 AM
In the settings of DLT you need to go to Compute -> advanced and aff configuration (of course, each key needs a value required in your case):
fs.azure.account.auth.type OAuth
fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id {{secrets/scope/databricks-clientid}}
fs.azure.account.oauth2.client.secret {{secrets/scope/databricks-secret}}
fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<YOUR_APP_ID>/oauth2/token
if you have the problem you can prefix all keys with spark_conf.spark.hadoop
12-09-2022 06:45 AM
Great, will try this. Thanks a lot!
12-20-2022 02:50 AM
Finally got it to work and here's what we did:
Pre-conditions:
Note:
DLT Pipeline settings - JSON (example of what worked, substitute for your own parameter values)
{
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5,
"mode": "ENHANCED"
}
},
{
"label": "maintenance"
}
],
"development": true,
"continuous": false,
"channel": "PREVIEW",
"edition": "ADVANCED",
"photon": false,
"libraries": [
{
"notebook": {
"path": "/Repos/repo_folder/delta-live-tables-notebooks/sql/Retail Sales"
}
}
],
"name": "Retail Sales SQL External Tables",
"storage": "abfss://<container>@<azure-storage-account>.dfs.core.windows.net/delta_live_tables/",
"configuration": {
"spark.hadoop.fs.azure.account.auth.type": "OAuth",
"spark.hadoop.fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"spark.hadoop.fs.azure.account.oauth2.client.id": "<application-id>",
"spark.hadoop.fs.azure.account.oauth2.client.secret": "{{secrets/<scope-name-in-databricks>/<secret-name-in-azure-keyvault>}}",
"spark.hadoop.fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/"<directory-id>"/oauth2/token"
}
}
Unity Catalog - Register a new External Table
CREATE TABLE IF NOT EXISTS customers
LOCATION 'abfss://<container-name>@<-storage-account-name>.dfs.core.windows.net/-folder/tables/customers/'
Summary - How to get DLT & Unity Catalog to Work Together
This way you can get Delta Live Tables (DLT) to work together with Unity Catalog by referring to external tables.
02-25-2023 06:37 PM
Thanks for the write up. Worked for me. Question. In my case, I let DLT write its tables to an abfss://data@... storage path using your config pattern above. It successfully writes the tables, but when I view the tables in hive_metastore in data explorer, I get the fs.configuration error. If I manually run the oauth routine to connect to that account, it connects, but I still can't view the data in data-explorer. However, I can create an external table in Unity just fine (as it's using the unity storage credentials).
05-01-2023 03:07 PM
I am trying to load from ADLS with DLT, after add the spark configurations I got below errors:
org.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = 818323fc-80d5-4833-9f46-7d1afc9c5bf7, runId = 722e9aac-0fdd-4206-9d49-683bb151f0bf] terminated with exception: The container in the file event `{"backfill":{"bucket":"root@dbstoragelhdp7mflfxe2y","key":"5810201264315799/Data/Temp/test_account.csv","size":1801,"eventTime":1682522202000,"newerThan$default$2":false}}` is different from expected by the source: `my_container@my_storageaccount`.
any sequestions?
Thanks,
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group