05-16-2022 10:07 AM
We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles.
And it is necessary to access the file modification timestamp of the file.
As documented here, we tried selecting `_metadata` column in a task in delta live pipelines without success. Are we doing something wrong?
The code snippet is below:
@dlt.table(
name = "bronze",
comment = f"New {SCHEMA} data incrementally ingested from S3",
table_properties = {
"quality": "bronze"
}
)
def bronze_job():
return spark \
.readStream \
.format("cloudFiles") \
.option("cloudFiles.useNotifications", "true") \
.option("cloudFiles.format", "csv") \
.option("cloudFiles.region", "eu-west-1") \
.option("delimiter", ",") \
.option("escape", "\"") \
.option("header", "false") \
.option("encoding", "UTF-8") \
.schema(cdc_schema) \
.load("/mnt/%s/cdc/%s" % (RAW_MOUNT_NAME, SCHEMA)) \
.select("*", "_metadata")
Thanks.
Tejas
05-17-2022 05:42 AM
Yes, on a standalone cluster (for any cluster outside of the DLT pipeline) this feature works using DR 10.5.
I found out the issue. We cannot choose run time (unable to set `spark_version`) in DLT pipeline settings. 😫
05-17-2022 02:35 AM
Are you using Databricks Runtime 10.5?
05-17-2022 05:42 AM
Yes, on a standalone cluster (for any cluster outside of the DLT pipeline) this feature works using DR 10.5.
I found out the issue. We cannot choose run time (unable to set `spark_version`) in DLT pipeline settings. 😫
05-18-2022 12:45 AM
Hi @Tejas Sherkar , Thank you for sharing the solution with the community. I'm glad that you could find out the solution to your problem. I am marking your answer as best eventually.
07-02-2022 11:42 AM
I'm having the same problem. Does this answer mean that there is no way to get file metadata using Delta Live Tables?
07-03-2022 11:13 AM
Currently, DLT is running on runtime 10.3. Once it is 10.5 or higher, it should be possible.
08-03-2022 05:54 AM
Update:
We were able to test `_metadata` column feature in DLT "preview" mode (which is DBR 11.0). Databricks doesn't recommend production workloads when using "preview" mode, but nevertheless, glad to be using this feature in DLT.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.