<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Query separate data loads from python spark.readStream in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/106002#M42338</link>
    <description>&lt;P&gt;To be clear, i want to add a new date column to so i can query the daily loads of inventory and product. I don't want to modify an existing column.&lt;/P&gt;</description>
    <pubDate>Thu, 16 Jan 2025 23:42:57 GMT</pubDate>
    <dc:creator>jb1z</dc:creator>
    <dc:date>2025-01-16T23:42:57Z</dc:date>
    <item>
      <title>Query separate data loads from python spark.readStream</title>
      <link>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/105840#M42284</link>
      <description>&lt;P&gt;I am using python spark.readStream in a Delta Live Tables pipeline to read json data files from a S3 folder path. Each load is a daily snapshot of a very similar set of products showing changes in price and inventory. How do i distinguish and query each daily load of json products?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import dlt
from datetime import datetime
folder_date = datetime.today().strftime('%Y-%m-%d')
@dlt.table(table_properties={'quality': 'bronze', 'delta.columnMapping.mode': 'name', 'delta.minReaderVersion': '2', 'delta.minWriterVersion': '5'})
def items_inventory_price():
  return (
     spark.readStream.format('cloudFiles')
     .option('cloudFiles.format', 'json')
     .option('delta.columnMapping.mode', 'name')
     .load(f's3://bucket/inventory/Item/{folder_date}')
    )&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was looking at `DESCRIBE HISTORY items_inventory_price` to use versions but these are not supported in Streaming Tables, the message is suggesting to switch to SQL warehouse.&amp;nbsp;&lt;/P&gt;&lt;P&gt;If I was able to add a date column to each data load I would be able to separate each load, or there may be metadata that i can use?&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 05:59:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/105840#M42284</guid>
      <dc:creator>jb1z</dc:creator>
      <dc:date>2025-01-16T05:59:36Z</dc:date>
    </item>
    <item>
      <title>Re: Query separate data loads from python spark.readStream</title>
      <link>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/105890#M42301</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/139986"&gt;@jb1z&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;You can use the &lt;CODE&gt;withColumn&lt;/CODE&gt; method to add a date column to your DataFrame. This column will store the date when the data was loaded and&amp;nbsp;update the &lt;CODE&gt;items_inventory_price&lt;/CODE&gt; function to include the date column&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 12:33:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/105890#M42301</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-01-16T12:33:10Z</dc:date>
    </item>
    <item>
      <title>Re: Query separate data loads from python spark.readStream</title>
      <link>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/105994#M42335</link>
      <description>&lt;P&gt;Thank you &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;&amp;nbsp;for your response. The error message also mentioned a shared cluster. I was able to get access to `describe history` by changing Access Mode = Shared from Single User in the Compute configuration.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 21:02:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/105994#M42335</guid>
      <dc:creator>jb1z</dc:creator>
      <dc:date>2025-01-16T21:02:05Z</dc:date>
    </item>
    <item>
      <title>Re: Query separate data loads from python spark.readStream</title>
      <link>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/106002#M42338</link>
      <description>&lt;P&gt;To be clear, i want to add a new date column to so i can query the daily loads of inventory and product. I don't want to modify an existing column.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 23:42:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/106002#M42338</guid>
      <dc:creator>jb1z</dc:creator>
      <dc:date>2025-01-16T23:42:57Z</dc:date>
    </item>
    <item>
      <title>Re: Query separate data loads from python spark.readStream</title>
      <link>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/106012#M42346</link>
      <description>&lt;P&gt;The community forum is making my Reply post disappear after I post, i have made 5 attempts.&lt;/P&gt;&lt;P&gt;I tried using .withColumn('ingestion_date', functions.col(folder_date)), after .load() but i am getting the error AnalysisException ... a column or function param cannot be resolved.&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jan 2025 03:25:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/106012#M42346</guid>
      <dc:creator>jb1z</dc:creator>
      <dc:date>2025-01-17T03:25:57Z</dc:date>
    </item>
    <item>
      <title>Re: Query separate data loads from python spark.readStream</title>
      <link>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/106036#M42359</link>
      <description>&lt;P&gt;The problem was fixed by this import&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; pyspark.sql &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; functions &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; F&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;then using F.lit() instead of F.col&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'ingestion_date'&lt;/SPAN&gt;&lt;SPAN&gt;, F.&lt;/SPAN&gt;&lt;SPAN&gt;lit&lt;/SPAN&gt;&lt;SPAN&gt;(folder_date))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Sorry code formatting is not working at the moment.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 17 Jan 2025 07:45:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/query-separate-data-loads-from-python-spark-readstream/m-p/106036#M42359</guid>
      <dc:creator>jb1z</dc:creator>
      <dc:date>2025-01-17T07:45:47Z</dc:date>
    </item>
  </channel>
</rss>

