Trouble accessing `_metadata` column using cloudFi...

tej1 · ‎05-16-2022

We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles.

And it is necessary to access the file modification timestamp of the file.

As documented here, we tried selecting `_metadata` column in a task in delta live pipelines without success. Are we doing something wrong?

The code snippet is below:

@dlt.table(
    name = "bronze",
    comment = f"New {SCHEMA} data incrementally ingested from S3",
    table_properties = {
        "quality": "bronze"
    }
)
def bronze_job():
    return spark \
            .readStream \
            .format("cloudFiles") \
            .option("cloudFiles.useNotifications", "true") \
            .option("cloudFiles.format", "csv") \
            .option("cloudFiles.region", "eu-west-1") \
            .option("delimiter", ",") \
            .option("escape", "\"") \
            .option("header", "false") \
            .option("encoding", "UTF-8") \
            .schema(cdc_schema) \
            .load("/mnt/%s/cdc/%s" % (RAW_MOUNT_NAME, SCHEMA)) \
            .select("*", "_metadata")

Thanks.

Tejas

Trouble accessing `_metadata` column using cloudFiles in Delta Live Tables