cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
Join discussions on data governance practices, compliance, and security within the Databricks Community. Exchange strategies and insights to ensure data integrity and regulatory compliance.
cancel
Showing results for 
Search instead for 
Did you mean: 

Publish data to Azure ADLS Gen2 from Delta Live Tables pipelines?

Alex006
Contributor

I would like to publish data from from delta live table (DLT) to an Azure ADLS Gen2 storage. Checked the documentation but cannot get it to work. Any insights on how to do so?

Invalid configuration value detected for fs.azure.account.keyStorage locationI've tried to specify the "Storage location" with many combinations of abfs://root@storageaccountname.dfs.core.windows.net/dev/delta_live_tables/ and also abfss://root@storageaccountname.dfs.core.windows.net/dev/delta_live_tables/ without any success.

Only succeeded to write to hive_metastore and dbfs so far. But I want to write to an external location (ADLS Gen2) so the tables can be referenced by Unity Catalog.

8 REPLIES 8

Rishabh-Pandey
Esteemed Contributor

if you are using abfs protocol , you need to configure some more configurations to read file from abfs protocol , so firstly mount a databricks worspace with that location then you can use abfs protocol i guess ,

Rishabh Pandey

I can use write and read from ADLS Gen2 using both the abfs and abfss protocol from regular notebooks. But for some reason it is not working for DLT.

Do you know if there is a need to mount ADLS Gen2 to make it work specifically with DLT?

User16752244991
New Contributor III

Mounting an ADLS container is no longer recommended. If you want DLT to materialize your data in ADLS, you need to do two things:

  1. In DLT Pipeline settings, configure ADLS credentials using either SAS token or Service Principal. See here for more details. If you are using SAS token, we also recommed using Secrets to store the key. See here for a concrete example.
  2. Specify ADLS location as your DLT pipeline storage path. e.g. abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/my_folder/

Hubert-Dudek
Esteemed Contributor III

In the settings of DLT you need to go to Compute -> advanced and aff configuration (of course, each key needs a value required in your case):

fs.azure.account.auth.type OAuth

fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider

fs.azure.account.oauth2.client.id {{secrets/scope/databricks-clientid}}

fs.azure.account.oauth2.client.secret {{secrets/scope/databricks-secret}}

fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<YOUR_APP_ID>/oauth2/token

if you have the problem you can prefix all keys with spark_conf.spark.hadoop

Alex006
Contributor

Great, will try this. Thanks a lot!

Alex006
Contributor

Finally got it to work and here's what we did:

Pre-conditions:

  • Service Principal created
  • Secret created in Azure KeyVault
  • Scope & Secret created in Databricks (Append "#secrets/createScope" in the url to get the databricks page where you create secrets)

Note:

  • Only one DLT pipeline per external storage location
  • The keys, related to ADLS access, are prefixed with "spark.hadoop."
  • You won't get lineage in Unity Catalog when using DLT, however, you see the lineage in DLT instead.

DLT Pipeline settings - JSON (example of what worked, substitute for your own parameter values)

{

   "clusters": [

       {

           "label": "default",

           "autoscale": {

               "min_workers": 1,

               "max_workers": 5,

               "mode": "ENHANCED"

           }

       },

       {

           "label": "maintenance"

       }

   ],

   "development": true,

   "continuous": false,

   "channel": "PREVIEW",

   "edition": "ADVANCED",

   "photon": false,

   "libraries": [

       {

           "notebook": {

               "path": "/Repos/repo_folder/delta-live-tables-notebooks/sql/Retail Sales"

           }

       }

   ],

   "name": "Retail Sales SQL External Tables",

   "storage": "abfss://<container>@<azure-storage-account>.dfs.core.windows.net/delta_live_tables/",

   "configuration": {

       "spark.hadoop.fs.azure.account.auth.type": "OAuth",

       "spark.hadoop.fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

       "spark.hadoop.fs.azure.account.oauth2.client.id": "<application-id>",

       "spark.hadoop.fs.azure.account.oauth2.client.secret": "{{secrets/<scope-name-in-databricks>/<secret-name-in-azure-keyvault>}}",

       "spark.hadoop.fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/"<directory-id>"/oauth2/token"

   }

}

Unity Catalog - Register a new External Table

CREATE TABLE IF NOT EXISTS customers

LOCATION 'abfss://<container-name>@<-storage-account-name>.dfs.core.windows.net/-folder/tables/customers/'

Summary - How to get DLT & Unity Catalog to Work Together

This way you can get Delta Live Tables (DLT) to work together with Unity Catalog by referring to external tables.

evogelpohl
New Contributor III

Thanks for the write up. Worked for me. Question. In my case, I let DLT write its tables to an abfss://data@... storage path using your config pattern above. It successfully writes the tables, but when I view the tables in hive_metastore in data explorer, I get the fs.configuration error. If I manually run the oauth routine to connect to that account, it connects, but I still can't view the data in data-explorer. However, I can create an external table in Unity just fine (as it's using the unity storage credentials).

guostong
New Contributor III

I am trying to load from ADLS with DLT, after add the spark configurations I got below errors:

org.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = 818323fc-80d5-4833-9f46-7d1afc9c5bf7, runId = 722e9aac-0fdd-4206-9d49-683bb151f0bf] terminated with exception: The container in the file event `{"backfill":{"bucket":"root@dbstoragelhdp7mflfxe2y","key":"5810201264315799/Data/Temp/test_account.csv","size":1801,"eventTime":1682522202000,"newerThan$default$2":false}}` is different from expected by the source: `my_container@my_storageaccount`.

any sequestions?

Thanks,

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group