03-22-2024 04:56 PM - edited 03-22-2024 05:06 PM
I have an external delta table in unity catalog (let's call it mycatalog.myschema.mytable) that only consists of a `_delta_log` directory that I create semi-manually, with the corresponding JSON files that define it.
The JSON files point to parquet files that are not in the same directory as the `_delta_log`, but in a different one (can even be a different Azure storage account, I am in Azure Databricks)
As an example, the JSON could look like this:
{
"add": {
"dataChange": true,
"modificationTime": 1710850923000,
"partitionValues": {},
"path": "abfss://mycontainer@mystorageaccount.dfs.core.windows.net/somepath/somefile.snappy.parquet",
"size": 12345,
"stats": "{\"numRecords\":123}",
"tags": {
"INSERTION_TIME": "1710850923000000",
"MAX_INSERTION_TIME": "1710850923000000",
"MIN_INSERTION_TIME": "1710850923000000",
"OPTIMIZE_TARGET_SIZE": "268435456"
}
}
}
When I try to read my delta table using spark.sql("SELECT * FROM mycatalog.myschema.mytable")` I get the following error:
RuntimeException: Couldn't initialize file system for path abfss://mycontainer@mystorageaccount.dfs.core.windows.net/somepath/somefile.snappy.parquet
which means Databricks is not trying to access that file using Unity external locations but the storage account key.
The path is declared in a external location and I can access it normally with UC credentials using
spark.read.load("abfss://mycontainer@mystorageaccount.dfs.core.windows.net/somepath/", format="delta")
03-26-2024 05:37 AM
Hi @migq2 ,
_delta_log
._delta_log
instead of absolute paths. This way, the external table can be queried without encountering the file system initialization issue.spark.read.load
with UC credentials to access the data. While this doesn’t directly involve the external table, it provides a workaround.03-26-2024 05:50 AM
Thanks for your reply Kaniz,
I understand your points, but I cannot use relative paths in my _delta_log because the files I need for my delta table are not all in the same path (they might not even be on the same storage account).
Copying them is not an option either because I am doing this at scale for many tables and many files
03-26-2024 05:43 AM
Besides what already has been mentioned, it is best to let the delta writer handle the location of _delta_log and the parquet files, they belong to each other.
03-26-2024 06:00 AM
Thank you, however in my specific case the parquet files are not written by Spark or Databricks, but by another external tool.
Also, some files are shared by multiple tables, or a table can have files in different storage accounts.
This makes having them in the same location as a normal spark writer would create them not feasible
03-26-2024 06:05 AM
I suggest you look at something else than UC for such cases. I also wonder if delta lake is the right format.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group