11-15-2024 07:26 AM
Great blog post: https://community.databricks.com/t5/technical-blog/integrating-apache-spark-with-databricks-unity-ca...
I have attempted to reproduce this with Azure Databricks, and ADLS gen2 as the storage backend.
Although I'm able to interact with unity catalog (successful "use schema" and then "select(current_schema()") and so on, when I try to append rows to a newly created managed table (as in the example above), I get the error below.
It looks like the temporary credential supplied by UC is failing. Any ideas what could be wrong here?
java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, PUT, https://storageaccout-xyz.dfs.core.windows.net/some/dir/__unitycatalog/..."
Any ideas what I'm doing wrong? cc @dkushari
P.S. it looks like the TM in the title of your blog post is preventing anyone from commenting there.
P.P.S. this forum software is extremely painful. It won't accept the post without a label, but it makes it incredibly difficult to select any labels.
11-15-2024 08:14 AM
Hi @charl-p-botha - Thanks for your post. Please see if you have all the right UC permissions to modify the table. Can you run the same command from a DB workspace and see if it passes? Please make sure all of these are satisfied. I will check on the P.S and P.P.S with the team.
11-15-2024 08:40 AM
Hi @charl-p-botha - Can you please ensure that you are logged into the Databricks community portal to put the comments for the blog?
11-15-2024 12:30 PM
I was definitely logged in. The error message I got each time as I tried to leave a comment was:
"
Correct the highlighted errors and try again.
The message subject contains <TM symbol>, which is not permitted in this community. Please remove this content before sending your post."
(It gave me the error again now here, so I replaced the actual symbol with <TM symbol>. The title of your blog contains the symbol, so all comments will get blocked in the same way.)
11-15-2024 12:28 PM
Hi there @dkushari thank you very much for getting back to me!
I have just confirmed that I am able to run exactly the same insert command in a databricks notebook. I am using a PAT token for that user account in my apache spark experiments.
Is there anything else I can try?
11-15-2024 12:38 PM - edited 11-15-2024 12:45 PM
Aaah, the link you shared has the following:
"
... and I was trying to append to a managed table.
That is a plausible explanation for the error I saw. However, in your demo, you were able to write to a managed table from the terminal: "You can insert data into the managed table from the local terminal as well as the Databricks workspace. First, insert and select some data from the local terminal:"
-- perhaps a good idea to add a note to the post that that is not yet possible in the current public preview?
In my case, I will then have to try the external table option, as for my use case I wish to be able to write to databricks unity catalog from my local spark.
11-18-2024 02:26 AM
@dkushari I hope that you can help me again.
I'm now trying the external table path, but Databricks UC refuses to give me the necessary temporary credentials for the storage location. Please see https://github.com/unitycatalog/unitycatalog/issues/560#issuecomment-2482597445 where I've added my commnent to an existing issue dealing with exactly the same error message
io.unitycatalog.client.ApiException: generateTemporaryPathCredentials call failed with: 401 - {"error_code":"UNAUTHENTICATED","message":"Request to generate access credential for path 'abfss://containerXYZ@storageAccXYZ.dfs.core.windows.net/somedir' from outside of Databricks Unity Catalog enabled compute environment is denied for security. Please contact Databricks support for integrations with Unity Catalog.","details":[{"@type":"type.googleapis.com/google.rpc.ErrorInfo","reason":"UNITY_CATALOG_EXTERNAL_GENERATE_PATH_CREDENTIALS_DENIED","domain":"unity-catalog.databricks.com","metadata":{"path":"abfss://containerXYZ@storageAccXYZ.dfs.core.windows.net/somedir"}},{"@type":"type.googleapis.com/google.rpc.RequestInfo","request_id":"e4be9d83-9e3e-47d9-bc04-1ad3f4c89ec5","serving_data":""}]}
11-18-2024 08:43 AM
Hi @charl-p-botha - Thanks for your post. Yes, you would need a SAFE Flag to be enabled for the workspace for you to achieve this. Until that time, can you please create the external table from the Databricks Workspace and modify the table from external engine. You can work with your Databricks Point of Contact to get help on enabling the SAFE flag for the workspace.
11-20-2024 02:04 AM - edited 11-20-2024 02:06 AM
Dear @dkushari
From an external Spark system, I can now read from a kafka topic and write into a UC-managed external table (underlying storage is Azure) that I pre-created in a Databricks notebook.
However, when I try to do exactly the same with readStream / writeStream, I get the error "Failure to initialize configuration for storageAccXYZ.dfs.core.windows.net key ="null": Invalid configuration value detected for fs.azure.account.key"
Can you confirm that it is expected for writeStream to fail under External Data Access? Will this eventually work?
In other words, this currently works:
df_kafka = spark.read.format("kafka").options(**KAFKA_OPTIONS).load()
# DELTA needs to be all caps, else this fails as well!
df_kafka.write.format("DELTA").mode("append").saveAsTable(catalog.schema.table)
but this fails:
df_kafka = spark.readStream.format("kafka").options(**KAFKA_OPTIONS).load()
df_kafka.writeStream.outputMode("append").trigger(availableNow=True).format("DELTA").toTable(bronze_table)
11-20-2024 05:28 AM
Hi @charl-p-botha - Thanks for your post. There could be an error w.r.t readStream. Check this one. It was also mentioned in the OSS UC slack channel that the same issue appears for Databricks managed UC. I have not yet tried the streaming.
Can you please confirm that the issue is with writeStream alone and readStream works?
11-20-2024 06:03 AM
Thanks @dkushari
I looked at the github issue you posted, but it has to do specifically with DELTA_UNSUPPORTED_SCHEMA_DURING_READ when streaming *from* a delta table.
The specific error I'm seeing is a key error for the Azure storage account hosting the destination external table. In my example, only bronze_table (the writeStream destination) has its storage on that specific storage account. In other words, it looks like writeStream + azure key (which should be vended by the UC temp credential logic) breaks, whilst write + azure key (same table, same everything else) works 100%.
I guess it could be the same issue as the one you posted, but it's raising a horribly confusing error message.
Do you have any other tips, or bug reports I could take a look at?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group