Databricks

tej1 · ‎05-16-2022

We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles.

And it is necessary to access the file modification timestamp of the file.

As documented here, we tried selecting `_metadata` column in a task in delta live pipelines without success. Are we doing something wrong?

The code snippet is below:

@dlt.table(
    name = "bronze",
    comment = f"New {SCHEMA} data incrementally ingested from S3",
    table_properties = {
        "quality": "bronze"
    }
)
def bronze_job():
    return spark \
            .readStream \
            .format("cloudFiles") \
            .option("cloudFiles.useNotifications", "true") \
            .option("cloudFiles.format", "csv") \
            .option("cloudFiles.region", "eu-west-1") \
            .option("delimiter", ",") \
            .option("escape", "\"") \
            .option("header", "false") \
            .option("encoding", "UTF-8") \
            .schema(cdc_schema) \
            .load("/mnt/%s/cdc/%s" % (RAW_MOUNT_NAME, SCHEMA)) \
            .select("*", "_metadata")

Thanks.

Tejas

tej1 · ‎05-17-2022

Yes, on a standalone cluster (for any cluster outside of the DLT pipeline) this feature works using DR 10.5.

I found out the issue. We cannot choose run time (unable to set `spark_version`) in DLT pipeline settings. 😫

View solution in original post

Hubert-Dudek · ‎05-17-2022

Are you using Databricks Runtime 10.5?

tej1 · ‎05-17-2022

Yes, on a standalone cluster (for any cluster outside of the DLT pipeline) this feature works using DR 10.5.

I found out the issue. We cannot choose run time (unable to set `spark_version`) in DLT pipeline settings. 😫

Kaniz · ‎05-18-2022

Hi @Tejas Sherkar , Thank you for sharing the solution with the community. I'm glad that you could find out the solution to your problem. I am marking your answer as best eventually.

colt · ‎07-02-2022

I'm having the same problem. Does this answer mean that there is no way to get file metadata using Delta Live Tables?

Hubert-Dudek · ‎07-03-2022

Currently, DLT is running on runtime 10.3. Once it is 10.5 or higher, it should be possible.

tej1 · ‎08-03-2022

Update:

We were able to test `_metadata` column feature in DLT "preview" mode (which is DBR 11.0). Databricks doesn't recommend production workloads when using "preview" mode, but nevertheless, glad to be using this feature in DLT.

Databricks

Trouble accessing `_metadata` column using cloudFiles in Delta Live Tables

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI