Databricks Community

migq2 · ‎03-22-2024

I have an external delta table in unity catalog (let's call it mycatalog.myschema.mytable) that only consists of a `_delta_log` directory that I create semi-manually, with the corresponding JSON files that define it.

The JSON files point to parquet files that are not in the same directory as the `_delta_log`, but in a different one (can even be a different Azure storage account, I am in Azure Databricks)

As an example, the JSON could look like this:

{
    "add": {
        "dataChange": true,
        "modificationTime": 1710850923000,
        "partitionValues": {},
        "path": "abfss://mycontainer@mystorageaccount.dfs.core.windows.net/somepath/somefile.snappy.parquet",
        "size": 12345,
        "stats": "{\"numRecords\":123}",
        "tags": {
            "INSERTION_TIME": "1710850923000000",
            "MAX_INSERTION_TIME": "1710850923000000",
            "MIN_INSERTION_TIME": "1710850923000000",
            "OPTIMIZE_TARGET_SIZE": "268435456"
        }
    }
}

When I try to read my delta table using spark.sql("SELECT * FROM mycatalog.myschema.mytable")` I get the following error:

RuntimeException: Couldn't initialize file system for path abfss://mycontainer@mystorageaccount.dfs.core.windows.net/somepath/somefile.snappy.parquet

which means Databricks is not trying to access that file using Unity external locations but the storage account key.

The path is declared in a external location and I can access it normally with UC credentials using
spark.read.load("abfss://mycontainer@mystorageaccount.dfs.core.windows.net/somepath/", format="delta")

Is there a way to use UC external locations with a delta table that uses absolute paths in the _delta_log? Due to security I don't want to add the storage account key to the cluster using spark.conf "fs.azure.account.key.mystorageaccount.dfs.core.windows.net

migq2 · ‎03-26-2024

Thanks for your reply Kaniz,

I understand your points, but I cannot use relative paths in my _delta_log because the files I need for my delta table are not all in the same path (they might not even be on the same storage account).

Copying them is not an option either because I am doing this at scale for many tables and many files

-werners- · ‎03-26-2024

Besides what already has been mentioned, it is best to let the delta writer handle the location of _delta_log and the parquet files, they belong to each other.

migq2 · ‎03-26-2024

Thank you, however in my specific case the parquet files are not written by Spark or Databricks, but by another external tool.

Also, some files are shared by multiple tables, or a table can have files in different storage accounts.

This makes having them in the same location as a normal spark writer would create them not feasible