<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Specifing Output mode and Path when using For Each Batch in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/specifing-output-mode-and-path-when-using-for-each-batch/m-p/121522#M46472</link>
    <description>&lt;P&gt;Thanks xD&lt;/P&gt;</description>
    <pubDate>Wed, 11 Jun 2025 19:01:19 GMT</pubDate>
    <dc:creator>Branislav</dc:creator>
    <dc:date>2025-06-11T19:01:19Z</dc:date>
    <item>
      <title>Specifing Output mode and Path when using For Each Batch</title>
      <link>https://community.databricks.com/t5/data-engineering/specifing-output-mode-and-path-when-using-for-each-batch/m-p/121302#M46413</link>
      <description>&lt;P&gt;Since .foreachBatch() is "hijacking" the stream and executing arbitrary code in it, do I need to specify Output mode and Path:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;(df.writeStream
.format("delta")
.trigger(availableNow = True)
.option("checkpointLocation", "check_point_location")

.foreachBatch(data_load)

.outputMode('update')
.option('path', output_filepath)
.start()
)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;Or I can do it without it:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;(df.writeStream
  .format("delta")
  .trigger(availableNow = True)
  .option("checkpointLocation", "check_point_location")

  .foreachBatch(data_load)

  .start()
  
)&lt;/LI-CODE&gt;&lt;P&gt;code for load_data:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def data_load(df, batchId):
  (target.alias("target").merge(
    source = df.alias("source"),
    condition = "target.key = source.key"
  ).whenMatchedUpdateAll()
  .whenNotMatchedInsertAll()
  .execute()
  )&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Jun 2025 08:39:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/specifing-output-mode-and-path-when-using-for-each-batch/m-p/121302#M46413</guid>
      <dc:creator>IGRACH</dc:creator>
      <dc:date>2025-06-10T08:39:22Z</dc:date>
    </item>
    <item>
      <title>Re: Specifing Output mode and Path when using For Each Batch</title>
      <link>https://community.databricks.com/t5/data-engineering/specifing-output-mode-and-path-when-using-for-each-batch/m-p/121509#M46469</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/152643"&gt;@IGRACH&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Good day!!&lt;/P&gt;
&lt;P data-start="0" data-end="142"&gt;When you're using &lt;CODE data-start="34" data-end="51"&gt;.foreachBatch()&lt;/CODE&gt;,&amp;nbsp;&amp;nbsp;it&amp;nbsp;&lt;STRONG data-start="189" data-end="213"&gt;"hijacks" the stream&lt;/STRONG&gt; and gives you the DataFrame for each micro-batch. Inside that function, &lt;STRONG data-start="286" data-end="321"&gt;you define exactly what happens&lt;/STRONG&gt; — whether you merge, update, insert, or write somewhere else.&lt;/P&gt;
&lt;P data-start="0" data-end="142"&gt;Because of this:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-start="545" data-end="647"&gt;
&lt;P data-start="547" data-end="647"&gt;Spark doesn’t need to know &lt;STRONG data-start="574" data-end="600"&gt;how to output the data&lt;/STRONG&gt; — it delegates that entirely to &lt;STRONG data-start="633" data-end="646"&gt;your code&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI data-start="648" data-end="751"&gt;
&lt;P data-start="650" data-end="751"&gt;So, &lt;CODE data-start="654" data-end="676"&gt;outputMode("append")&lt;/CODE&gt; / &lt;CODE data-start="679" data-end="701"&gt;outputMode("update")&lt;/CODE&gt; / &lt;CODE data-start="704" data-end="728"&gt;outputMode("complete")&lt;/CODE&gt; become &lt;STRONG data-start="736" data-end="750"&gt;irrelevant&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="0" data-end="142"&gt;So you need not specify&amp;nbsp;&lt;STRONG&gt;outputMode(...)&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="0" data-end="142"&gt;&lt;STRONG&gt;You can use the below approach.&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="0" data-end="142"&gt;df.writeStream \&lt;BR /&gt;.trigger(availableNow=True) \&lt;BR /&gt;.option("checkpointLocation", "check_point_location") \&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;.foreachBatch(data_load) \&lt;BR /&gt;.start()&lt;/P&gt;
&lt;P data-start="0" data-end="142"&gt;&lt;STRONG&gt;Kindly let me know if you have any questions on this.&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;
&lt;H3 data-start="144" data-end="181"&gt;&amp;nbsp;&lt;/H3&gt;</description>
      <pubDate>Wed, 11 Jun 2025 16:49:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/specifing-output-mode-and-path-when-using-for-each-batch/m-p/121509#M46469</guid>
      <dc:creator>Saritha_S</dc:creator>
      <dc:date>2025-06-11T16:49:38Z</dc:date>
    </item>
    <item>
      <title>Re: Specifing Output mode and Path when using For Each Batch</title>
      <link>https://community.databricks.com/t5/data-engineering/specifing-output-mode-and-path-when-using-for-each-batch/m-p/121522#M46472</link>
      <description>&lt;P&gt;Thanks xD&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jun 2025 19:01:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/specifing-output-mode-and-path-when-using-for-each-batch/m-p/121522#M46472</guid>
      <dc:creator>Branislav</dc:creator>
      <dc:date>2025-06-11T19:01:19Z</dc:date>
    </item>
  </channel>
</rss>

