cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Use Unity External Location with full paths in delta_log

migq2
New Contributor III

I have an external delta table in unity catalog (let's call it mycatalog.myschema.mytable) that only consists of a `_delta_log` directory that I create semi-manually, with the corresponding JSON files that define it. 

The JSON files point to parquet files that are not in the same directory as the `_delta_log`, but in a different one (can even be a different Azure storage account, I am in Azure Databricks)

As an example, the JSON could look like this: 

 

 

{
    "add": {
        "dataChange": true,
        "modificationTime": 1710850923000,
        "partitionValues": {},
        "path": "abfss://mycontainer@mystorageaccount.dfs.core.windows.net/somepath/somefile.snappy.parquet",
        "size": 12345,
        "stats": "{\"numRecords\":123}",
        "tags": {
            "INSERTION_TIME": "1710850923000000",
            "MAX_INSERTION_TIME": "1710850923000000",
            "MIN_INSERTION_TIME": "1710850923000000",
            "OPTIMIZE_TARGET_SIZE": "268435456"
        }
    }
}

 

 


When I try to read my delta table using spark.sql("SELECT * FROM mycatalog.myschema.mytable")` I get the following error:

RuntimeException: Couldn't initialize file system for path abfss://mycontainer@mystorageaccount.dfs.core.windows.net/somepath/somefile.snappy.parquet

which means Databricks is not trying to access that file using Unity external locations but the storage account key. 

The path is declared in a external location and I can access it normally with UC credentials using 
spark.read.load("abfss://mycontainer@mystorageaccount.dfs.core.windows.net/somepath/", format="delta")

 
Is there a way to use UC external locations with a delta table that uses absolute paths in the _delta_log? Due to security I don't want to add the storage account key to the cluster using spark.conf "fs.azure.account.key.mystorageaccount.dfs.core.windows.net

 

4 REPLIES 4

migq2
New Contributor III

Thanks for your reply Kaniz,

I understand your points, but I cannot use relative paths in my _delta_log because the files I need for my delta table are not all in the same path (they might not even be on the same storage account). 

Copying them is not an option either because I am doing this at scale for many tables and many files

-werners-
Esteemed Contributor III

Besides what already has been mentioned, it is best to let the delta writer handle the location of _delta_log and the parquet files,  they belong to each other.

migq2
New Contributor III

Thank you, however in my specific case the parquet files are not written by Spark or Databricks, but by another external tool.

Also, some files are shared by multiple tables, or a table can have files in different storage accounts. 

This makes having them in the same location as a normal spark writer would create them not feasible  

-werners-
Esteemed Contributor III

I suggest you look at something else than UC for such cases.  I also wonder if delta lake is the right format.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group