Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-15-2023 07:16 PM
You can add a column and give it a value of the days date that you ran for the newly added data with the selectExpr() function in autoloader. Itd look something like this...
From pyspark.sql.functions import current_timestamp
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path-to-checkpoint>") \
.load("<source-data-with-nested-json>") \
.selectExpr(
"*",
"current_timestamp() as `Date_Pulled`",
)