<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to make the write operation faster for writing a spark dataframe to a delta table in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-make-the-write-operation-faster-for-writing-a-spark/m-p/111475#M43905</link>
    <description>&lt;P&gt;We recommend using spatial frameworks to speed up things like spatial joins, point-in-polygon, etc, like &lt;A href="https://github.com/databrickslabs/mosaic" target="_self"&gt;databricks mosaic&lt;/A&gt;&amp;nbsp;or &lt;A href="https://sedona.apache.org/1.6.0/setup/databricks/" target="_self"&gt;apache sedona&lt;/A&gt;. Without these frameworks, many of these operations result in unoptimized and explosive crossjoins.&lt;/P&gt;</description>
    <pubDate>Fri, 28 Feb 2025 16:34:58 GMT</pubDate>
    <dc:creator>cgrant</dc:creator>
    <dc:date>2025-02-28T16:34:58Z</dc:date>
    <item>
      <title>How to make the write operation faster for writing a spark dataframe to a delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-make-the-write-operation-faster-for-writing-a-spark/m-p/111344#M43853</link>
      <description>&lt;P&gt;So, I am doing 4 spatial join operation on the files with the following sizes:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Base_road_file which is 1gigabyte&lt;/LI&gt;&lt;LI&gt;Telematics file which is 1.2 gigs&lt;/LI&gt;&lt;LI&gt;state boundary file , BH road file, client_geofence file and kpmg_geofence_file which are not too large&amp;nbsp;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;My databricks cluster details are as follows:&lt;/P&gt;&lt;P&gt;13.3 LTS runtme, Standard_DS5_v2 56gb mem 16 cores for driver and worker nodes&lt;/P&gt;&lt;P&gt;The issue is that the joins happen within seconds but writing to a delta table is timing out my entire run&amp;gt;Moreover, even if I increase the time out the whole operation keeps running for hours which is not good for my client.&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, could anyone please suggest what to do. I have even tried repartition but have added optimizeWrite to my spark session settings as well but nothing seems to help. So, could anyone please suggest a way to make my write operation faster.&lt;/P&gt;</description>
      <pubDate>Thu, 27 Feb 2025 01:22:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-make-the-write-operation-faster-for-writing-a-spark/m-p/111344#M43853</guid>
      <dc:creator>Sjoshi</dc:creator>
      <dc:date>2025-02-27T01:22:58Z</dc:date>
    </item>
    <item>
      <title>Re: How to make the write operation faster for writing a spark dataframe to a delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-make-the-write-operation-faster-for-writing-a-spark/m-p/111473#M43903</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/151058"&gt;@Sjoshi&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;I think the following information would be helpful to understand more about the problem you're experiencing:&lt;BR /&gt;- The schema of the tables involved in the join.&lt;BR /&gt;- The join condition used on each join.&lt;BR /&gt;- How the inputs are stored before the job reads them for the join.&lt;BR /&gt;- Any spark configuration options you're setting apart from the default settings.&lt;BR /&gt;- The query plans you can see, either from the SparkUI, or from running an explain.&lt;BR /&gt;&lt;BR /&gt;Additionally, I've found this past talk from a Spark Summit very helpful for inspecting and improving the performance of my own workloads.&amp;nbsp;&lt;A href="https://www.youtube.com/watch?v=daXEp4HmS-E" target="_blank"&gt;https://www.youtube.com/watch?v=daXEp4HmS-E&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 28 Feb 2025 16:12:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-make-the-write-operation-faster-for-writing-a-spark/m-p/111473#M43903</guid>
      <dc:creator>Nik_Vanderhoof</dc:creator>
      <dc:date>2025-02-28T16:12:43Z</dc:date>
    </item>
    <item>
      <title>Re: How to make the write operation faster for writing a spark dataframe to a delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-make-the-write-operation-faster-for-writing-a-spark/m-p/111475#M43905</link>
      <description>&lt;P&gt;We recommend using spatial frameworks to speed up things like spatial joins, point-in-polygon, etc, like &lt;A href="https://github.com/databrickslabs/mosaic" target="_self"&gt;databricks mosaic&lt;/A&gt;&amp;nbsp;or &lt;A href="https://sedona.apache.org/1.6.0/setup/databricks/" target="_self"&gt;apache sedona&lt;/A&gt;. Without these frameworks, many of these operations result in unoptimized and explosive crossjoins.&lt;/P&gt;</description>
      <pubDate>Fri, 28 Feb 2025 16:34:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-make-the-write-operation-faster-for-writing-a-spark/m-p/111475#M43905</guid>
      <dc:creator>cgrant</dc:creator>
      <dc:date>2025-02-28T16:34:58Z</dc:date>
    </item>
  </channel>
</rss>

