cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Trouble accessing `_metadata` column using cloudFiles in Delta Live Tables

tej1
New Contributor III

We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles.

And it is necessary to access the file modification timestamp of the file.

As documented here, we tried selecting `_metadata` column in a task in delta live pipelines without success. Are we doing something wrong?

The code snippet is below:

@dlt.table(
    name = "bronze",
    comment = f"New {SCHEMA} data incrementally ingested from S3",
    table_properties = {
        "quality": "bronze"
    }
)
def bronze_job():
    return spark \
            .readStream \
            .format("cloudFiles") \
            .option("cloudFiles.useNotifications", "true") \
            .option("cloudFiles.format", "csv") \
            .option("cloudFiles.region", "eu-west-1") \
            .option("delimiter", ",") \
            .option("escape", "\"") \
            .option("header", "false") \
            .option("encoding", "UTF-8") \
            .schema(cdc_schema) \
            .load("/mnt/%s/cdc/%s" % (RAW_MOUNT_NAME, SCHEMA)) \
            .select("*", "_metadata")

Thanks.

Tejas

1 ACCEPTED SOLUTION

Accepted Solutions

tej1
New Contributor III

Yes, on a standalone cluster (for any cluster outside of the DLT pipeline) this feature works using DR 10.5.

I found out the issue. We cannot choose run time (unable to set `spark_version`) in DLT pipeline settings. 😫

View solution in original post

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

Are you using Databricks Runtime 10.5?

tej1
New Contributor III

Yes, on a standalone cluster (for any cluster outside of the DLT pipeline) this feature works using DR 10.5.

I found out the issue. We cannot choose run time (unable to set `spark_version`) in DLT pipeline settings. 😫

colt
New Contributor III

I'm having the same problem. Does this answer mean that there is no way to get file metadata using Delta Live Tables?

Hubert-Dudek
Esteemed Contributor III

Currently, DLT is running on runtime 10.3. Once it is 10.5 or higher, it should be possible.

tej1
New Contributor III

Update:

We were able to test `_metadata` column feature in DLT "preview" mode (which is DBR 11.0). Databricks doesn't recommend production workloads when using "preview" mode, but nevertheless, glad to be using this feature in DLT.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group