cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Is it possible to use other databases within Delta Live Tables (DLT)?

477061
Contributor

I have set up a DLT with "testing" set as the target database. I need to join data that exists in a "keys" table in my "beta" database, but I get an AccessDeniedException, despite having full access to both databases via a normal notebook.

A snippet demonstrating the issue is below (Note that I don't appear to have the code snippet button available in Chrome):

The following would be data from the keys table:

```

key_df = spark.createDataFrame([[0, "State0"]], ['key', 'name'])

```

Then the DLT:

```

@dlt.table()

def dlt_test_data():

df = spark.createDataFrame([["40873", 0]], ['version', 'key'])

return df

@dlt.table()

def dlt_test_silver():

data_df = dlt.read("dlt_test_data")

key_df = enum_df = spark.table("beta.keys")

joined_df = data_df.join(key_df, "key")

return joined_df

```

I also tried importing the data in a view and reading from that in dlt_test_silver, but this also failed.

Is it possible to use other databases within a DLT pipeline? If so, how?

1 ACCEPTED SOLUTION

Accepted Solutions

477061
Contributor

As an update to this issue: I was running the DLT pipeline on a personal cluster that had an instance profile defined (as per databricks best practises). As a result, the pipeline did not have permission to access other s3 resources (e.g other databases). The solution was to explicitly add the AWS permissions to the pipeline. To do this you need to edit the JSON settings of the pipeline, and add the "aws_attributes" key to the "clusters" definition (discussed here). The value for the "aws_attributes" key could be found by going to the cluster definition menu, clicking edit, then viewing the JSON of the cluster definition.

Apparently this issue would not have occurred had I been using a job cluster.

View solution in original post

11 REPLIES 11

-werners-
Esteemed Contributor III

what exactly do you mean by 'database'? Actual RDBMS systems or databases in the Hive metastore (or Unity)?

477061
Contributor

Apologies, I'm new to Databricks. I mean databases in the Hive metastore

-werners-
Esteemed Contributor III

no problem.

It should work, hive databases are only a semantic view on top of the actual files residing in your data lake.

Have you enabled table access control?

477061
Contributor

Table access control is currently disabled, and admin is refusing to enable it. Does this need to be enabled for the DLT pipeline to use other hive metastore databases?

-werners-
Esteemed Contributor III

no, but it could be the issue if it were enabled.

what happens if you read the parquet (or delta lake or csv or ....) directly, without using the hive_metastore?

I have just tried to read from the parquet file directly and it fails with the same error.

-werners-
Esteemed Contributor III

I guess it's one of those things that still have to be sorted out on DLT.

Anonymous
Not applicable

Right now - Yes, DLT only supports one target database. So if you need to push into different databases, then you may have two DLT pipelines.

https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-quickstart.html#requiremen...

hope this help cheer

Thanks, yes, I knew about the limitation on the target database. There does not appear to be any documentation on the limitation on source databases (for clarity, I want to pull from 2 source databases, but push to 1 target database). But I assume it comes to the same thing

Thanks, but no, I have not yet received an answer. Other customers have tried to help, but there is still no resolution.

477061
Contributor

As an update to this issue: I was running the DLT pipeline on a personal cluster that had an instance profile defined (as per databricks best practises). As a result, the pipeline did not have permission to access other s3 resources (e.g other databases). The solution was to explicitly add the AWS permissions to the pipeline. To do this you need to edit the JSON settings of the pipeline, and add the "aws_attributes" key to the "clusters" definition (discussed here). The value for the "aws_attributes" key could be found by going to the cluster definition menu, clicking edit, then viewing the JSON of the cluster definition.

Apparently this issue would not have occurred had I been using a job cluster.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group