Query separate data loads from python spark.readStream

jb1z
Contributor

I am using python spark.readStream in a Delta Live Tables pipeline to read json data files from a S3 folder path. Each load is a daily snapshot of a very similar set of products showing changes in price and inventory. How do i distinguish and query each daily load of json products? 

 

import dlt
from datetime import datetime
folder_date = datetime.today().strftime('%Y-%m-%d')
@dlt.table(table_properties={'quality': 'bronze', 'delta.columnMapping.mode': 'name', 'delta.minReaderVersion': '2', 'delta.minWriterVersion': '5'})
def items_inventory_price():
  return (
     spark.readStream.format('cloudFiles')
     .option('cloudFiles.format', 'json')
     .option('delta.columnMapping.mode', 'name')
     .load(f's3://bucket/inventory/Item/{folder_date}')
    )

 

I was looking at `DESCRIBE HISTORY items_inventory_price` to use versions but these are not supported in Streaming Tables, the message is suggesting to switch to SQL warehouse. 

If I was able to add a date column to each data load I would be able to separate each load, or there may be metadata that i can use?