<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Fastest way to write a Spark Dataframe to a delta table in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/70016#M33969</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hello &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105525"&gt;@nakaxa&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;Spark lazily evaluates its plans, and based on your issue description, it appears that the dataframe's origin is not Spark itself. Since Spark commands are lazily evaluated, I suspect that the time-consuming aspect is not the write itself but the preceding operations.&lt;/P&gt;
&lt;P&gt;If your data source is in-memory (driver memory) and you're transforming it into a Spark dataframe, all processing before the write operation occurs on the driver node. This node then shuffles the data between the 32 executors before performing the write, thereby benefiting from Spark's parallelism.&lt;/P&gt;
&lt;P&gt;If you want to benefit from Spark parallelism and performance throughout your whole job, avoid using non-spark datasets and these kind of conversions.&lt;/P&gt;
&lt;P&gt;Please let me know if my answer is helpful for your case.&lt;/P&gt;</description>
    <pubDate>Mon, 20 May 2024 19:54:48 GMT</pubDate>
    <dc:creator>raphaelblg</dc:creator>
    <dc:date>2024-05-20T19:54:48Z</dc:date>
    <item>
      <title>Fastest way to write a Spark Dataframe to a delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/70003#M33960</link>
      <description>&lt;P&gt;I read a huge array with several columns into memory, then I convert it into a spark dataframe,&amp;nbsp; when I want to write to a delta table it using the following command it takes forever (I have a driver with large memory and 32 workers) : df_exp.write.mode("append").format("delta").saveAsTable(save_table_name) How can I write this the fastest possible to a delta table?&lt;/P&gt;</description>
      <pubDate>Mon, 20 May 2024 15:57:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/70003#M33960</guid>
      <dc:creator>nakaxa</dc:creator>
      <dc:date>2024-05-20T15:57:10Z</dc:date>
    </item>
    <item>
      <title>Re: Fastest way to write a Spark Dataframe to a delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/70016#M33969</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hello &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105525"&gt;@nakaxa&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;Spark lazily evaluates its plans, and based on your issue description, it appears that the dataframe's origin is not Spark itself. Since Spark commands are lazily evaluated, I suspect that the time-consuming aspect is not the write itself but the preceding operations.&lt;/P&gt;
&lt;P&gt;If your data source is in-memory (driver memory) and you're transforming it into a Spark dataframe, all processing before the write operation occurs on the driver node. This node then shuffles the data between the 32 executors before performing the write, thereby benefiting from Spark's parallelism.&lt;/P&gt;
&lt;P&gt;If you want to benefit from Spark parallelism and performance throughout your whole job, avoid using non-spark datasets and these kind of conversions.&lt;/P&gt;
&lt;P&gt;Please let me know if my answer is helpful for your case.&lt;/P&gt;</description>
      <pubDate>Mon, 20 May 2024 19:54:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/70016#M33969</guid>
      <dc:creator>raphaelblg</dc:creator>
      <dc:date>2024-05-20T19:54:48Z</dc:date>
    </item>
    <item>
      <title>Re: Fastest way to write a Spark Dataframe to a delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/70017#M33970</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105525"&gt;@nakaxa&lt;/a&gt;, how are you?&lt;/P&gt;
&lt;P&gt;Although this is the simplest and best approach to command spark the creation of your table, you can check the SparkUI to understand where possible bottlenecks are happening. Check for the jobs and stages where most time is being spend. After that, you can see if to much data is being shuffled through the network. If that's the case, you can increase the size of your workers and enable the disk autoscale on your cluster to process the data faster.&lt;/P&gt;
&lt;P&gt;Best,&lt;/P&gt;
&lt;P&gt;Alessandro&lt;/P&gt;</description>
      <pubDate>Mon, 20 May 2024 20:00:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/70017#M33970</guid>
      <dc:creator>anardinelli</dc:creator>
      <dc:date>2024-05-20T20:00:16Z</dc:date>
    </item>
    <item>
      <title>Re: Fastest way to write a Spark Dataframe to a delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/99610#M40041</link>
      <description>&lt;P&gt;The answers here are not correct.&lt;/P&gt;&lt;P&gt;TLDR: _After_ the Spark DF is materialized, saveAsTable takes ages. 35seconds for 1million rows.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;saveAsTable() is SLOW - terribly so. Why? Would be nice to get an answer. The workaround is to avoid spark for delta - note I am not using Photon out of reasons. So just writing plain parquet with pyarrow.parquet and then read them with a SQL warehouse into a delta table (using Photon).&lt;BR /&gt;&lt;BR /&gt;I have a tiny arrow data frame with 19 columns and 1million rows. The whole computation takes 2 seconds in stupid python and&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;spark_df &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; spark.&lt;/SPAN&gt;&lt;SPAN&gt;createDataFrame&lt;/SPAN&gt;&lt;SPAN&gt;(data.&lt;/SPAN&gt;&lt;SPAN&gt;to_pandas&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;spark_df.&lt;/SPAN&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;BR /&gt;take 1 second.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN&gt;Then comes&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;spark_df.write.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"delta"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;mode&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"append"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;saveAsTable&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"default.hello_sleepy"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;with a whopping 35 seconds?! What is that? Running this single threaded with delta-io writes instantly. Also pyarrow.parquet.write_table take a second. But saveAsTable 35? What is going on here?&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;When it is figured out to run the calculation equally fast single threaded on Databricks Spark as on a Raspberry Pi - then I would like to run this on worker executors for 15000 files in parallel. Actually this whole exercies might be better done in Lambda, but still it should be possible.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;What am I missing?&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 21 Nov 2024 13:30:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/99610#M40041</guid>
      <dc:creator>Reiska</dc:creator>
      <dc:date>2024-11-21T13:30:32Z</dc:date>
    </item>
    <item>
      <title>Re: Fastest way to write a Spark Dataframe to a delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/99618#M40047</link>
      <description>&lt;P&gt;I have a slight suspicion here that createDataFrame is using the columnar arrow for .display() but when finally writing the row based representation of Spark kicks in and the data is costly reserialized:&lt;/P&gt;&lt;P&gt;I cannot find the right place in the Documentation so I have no reference but it seems:&lt;BR /&gt;When creating a DataFrame in Spark, the data is row-based.&amp;nbsp;Spark uses its internal Row or InternalRow objects to represent each record.&lt;/P&gt;</description>
      <pubDate>Thu, 21 Nov 2024 13:42:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/99618#M40047</guid>
      <dc:creator>Reiska</dc:creator>
      <dc:date>2024-11-21T13:42:14Z</dc:date>
    </item>
    <item>
      <title>Re: Fastest way to write a Spark Dataframe to a delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/155811#M54317</link>
      <description>&lt;P&gt;Out of interest, Did you try seeing what happens if you break the steps down into something like...&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;df.write() .format(&lt;/SPAN&gt;&lt;SPAN class=""&gt;"parquet"&lt;/SPAN&gt;&lt;SPAN class=""&gt;) .mode(SaveMode.Overwrite) .save(parquetPath);&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;Followed by....&lt;BR /&gt;spark.sql("CREATE TABLE my_delta_table USING DELTA LOCATION '" + parquetPath + "'");&lt;BR /&gt;&lt;BR /&gt;To confirm the slow point was in the actual writing out of the dataframe?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Apr 2026 15:04:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/fastest-way-to-write-a-spark-dataframe-to-a-delta-table/m-p/155811#M54317</guid>
      <dc:creator>ShawnRR</dc:creator>
      <dc:date>2026-04-29T15:04:32Z</dc:date>
    </item>
  </channel>
</rss>

