โ05-16-2022 10:07 AM
We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles.
And it is necessary to access the file modification timestamp of the file.
As documented here, we tried selecting `_metadata` column in a task in delta live pipelines without success. Are we doing something wrong?
The code snippet is below:
@dlt.table(
name = "bronze",
comment = f"New {SCHEMA} data incrementally ingested from S3",
table_properties = {
"quality": "bronze"
}
)
def bronze_job():
return spark \
.readStream \
.format("cloudFiles") \
.option("cloudFiles.useNotifications", "true") \
.option("cloudFiles.format", "csv") \
.option("cloudFiles.region", "eu-west-1") \
.option("delimiter", ",") \
.option("escape", "\"") \
.option("header", "false") \
.option("encoding", "UTF-8") \
.schema(cdc_schema) \
.load("/mnt/%s/cdc/%s" % (RAW_MOUNT_NAME, SCHEMA)) \
.select("*", "_metadata")
Thanks.
Tejas
โ05-17-2022 05:42 AM
Yes, on a standalone cluster (for any cluster outside of the DLT pipeline) this feature works using DR 10.5.
I found out the issue. We cannot choose run time (unable to set `spark_version`) in DLT pipeline settings. ๐ซ
โ05-17-2022 02:35 AM
Are you using Databricks Runtime 10.5?
โ05-17-2022 05:42 AM
Yes, on a standalone cluster (for any cluster outside of the DLT pipeline) this feature works using DR 10.5.
I found out the issue. We cannot choose run time (unable to set `spark_version`) in DLT pipeline settings. ๐ซ
โ05-18-2022 12:45 AM
Hi @Tejas Sherkarโ , Thank you for sharing the solution with the community. I'm glad that you could find out the solution to your problem. I am marking your answer as best eventually.
โ07-02-2022 11:42 AM
I'm having the same problem. Does this answer mean that there is no way to get file metadata using Delta Live Tables?
โ07-03-2022 11:13 AM
Currently, DLT is running on runtime 10.3. Once it is 10.5 or higher, it should be possible.
โ08-03-2022 05:54 AM
Update:
We were able to test `_metadata` column feature in DLT "preview" mode (which is DBR 11.0). Databricks doesn't recommend production workloads when using "preview" mode, but nevertheless, glad to be using this feature in DLT.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group