09-03-2024 03:01 AM
Hi,
I want to create an external location from Azure Databricks to a Microsoft Fabric Lakehouse, but seems I am missing something.
What did I do:
Now I want to create an external location in Azure Databricks with the OneLake path, but get an error:
Failed to access cloud storage: [AbfsRestOperationException]
The pathes I tried out are of the following pattern:
abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/
abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/{lakehouse_name}
abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/{lakehouse_name}.Lakehouse
abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/{lakehouse_name}.Lakehouse/
As I struggled so far regarding documentation for this use case (connecting from Databricks to Fabric, not the other way round), I may also be on the wrong path.
Any tips what might be the issues?
Best, Stefan
PS: Fabric Lakehouse has an abfss::/ path which I already validated to read data from (within a Fabric notebook).
import pandas as pd
pd.read_parquet(f"abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/{lakehouse_name}.Lakehouse/Tables/{table_name}")
Sources:
[1] Advancing Spark - External Tables with Unity Catalog - YouTube (I tried this approach to make it work with granting Fabric workspace access instead of ADLS Gen2 access in Azure Portal)
09-03-2024 03:05 AM - edited 09-03-2024 03:12 AM
Hi @stefanberreiter ,
You need to grant access to the storage account used by your Microsoft Fabric instance, not to the fabric workspace itself.
And also you need to following role for access connector:
- Storage Blob Data Contributor
So viewer and contributor are not correct.
Use Azure managed identities in Unity Catalog to access storage - Azure Databricks | Microsoft Learn
09-03-2024 03:25 AM
Hi @szymon_dybczak,
thanks for helping out.
It seems (at least from this blog post) that you can do it directly from within a Fabric Workspace (grant access) once in Tenant settings you enabled for OneLake settings "Users can access data stored in OneLake with apps external to Fabric".
from source: "The second setting can be found a bit further down under OneLake settings. This setting allows you to use non-Fabric applications like a Python SDK, Databricks, and more to read and write to the OneLake."
Do you know what else one would need to configure (and where) to add the Managed Identity? Can you guide me a bit more with what you said in terms of "You need to grant an access for Access Connector for Azure Databricks to the storage account your Fabric instance use" as I believe it's automatically managed in OneLake ("OneLake comes automatically with every Microsoft Fabric tenant").
How to use service principal authentication to access Microsoft Fabric's OneLake (dataroots.io)
09-03-2024 03:46 AM
Hi @stefanberreiter ,
Ok, so it looks like you need to enable Azure Data Lake Storage credential passthrough to make it work. Did you do this step?
Below is step by step instuction from documenation:
Integrate OneLake with Azure Databricks - Microsoft Fabric | Microsoft Learn
And also you can take a look on below video:
Leverage OneLake with Azure Databricks (youtube.com)
09-03-2024 04:13 AM
Hi @szymon_dybczak ,
thanks for replying. I've researched a bit into it (thanks for the sources) - now a few more questions are popping up.
It seems like Credential Passthrough will be deprecated and it just works in conjunction with a cluster - while I am looking for a way to have an external table. So the idea would be pointing at the storage of the Lakehouse data in Fabric, not reading and then copying it to Databricks (which I believe is the use case of the video).
09-03-2024 03:54 AM
And if you want to use service principal authentication (assuming you have already one) then you need to add this servicep principal to fabric workspace (like in url you send :How to use service principal authentication to access Microsoft Fabric's OneLake (dataroots.io)).
Then you can use service principal authentication in following way in databricks:
storage_account = "<storage_account>"
tenant_id = "<tenant_id>"
service_principal_id = "<service_principal_id>"
service_principal_password = "<service_principal_password>"
spark.conf.set(f"fs.azure.account.auth.type.{storage_account}.dfs.core.windows.net", "OAuth")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{storage_account}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"fs.azure.account.oauth2.client.id.{storage_account}.dfs.core.windows.net", service_principal_id)
spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storage_account}.dfs.core.windows.net", service_principal_password)
spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storage_account}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")
# read with spn
df = spark.read.format("parquet").load(f"abfss://default@{storage_account}.dfs.core.windows.net/data/unmanaged/t_unmanag_parquet")
df.show(10)
# write
df.write.format("delta").mode("overwrite").save(f"abfss://default@{storage_account}.dfs.core.windows.net/data/unmanaged/fab_unmanag_delta_spn")
09-03-2024 04:20 AM
I guess I'm looking now into Lakehouse federation for the SQL endpoint of the Fabric Lakehouse - which comes closest to the experience of the External Table I guess.
Running Federated Queries from Unity Catalog on Microsoft Fabric SQL Endpoint | by Aitor Murguzur | ...
09-03-2024 04:27 AM - edited 09-03-2024 04:32 AM
Yeah, that seems like a good option. Thought it also uses service princpal to authenticate. I think in the future they will add ability to use databrticks access connector (MSI) as a valid authentication option to one lake.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group