Fastest way to write a Spark Dataframe to a delta table
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-20-2024 08:57 AM
I read a huge array with several columns into memory, then I convert it into a spark dataframe, when I want to write to a delta table it using the following command it takes forever (I have a driver with large memory and 32 workers) : df_exp.write.mode("append").format("delta").saveAsTable(save_table_name) How can I write this the fastest possible to a delta table?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-20-2024 12:54 PM
Hello @nakaxa ,
Spark lazily evaluates its plans, and based on your issue description, it appears that the dataframe's origin is not Spark itself. Since Spark commands are lazily evaluated, I suspect that the time-consuming aspect is not the write itself but the preceding operations.
If your data source is in-memory (driver memory) and you're transforming it into a Spark dataframe, all processing before the write operation occurs on the driver node. This node then shuffles the data between the 32 executors before performing the write, thereby benefiting from Spark's parallelism.
If you want to benefit from Spark parallelism and performance throughout your whole job, avoid using non-spark datasets and these kind of conversions.
Please let me know if my answer is helpful for your case.
Raphael Balogo
Sr. Technical Solutions Engineer
Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-20-2024 01:00 PM
Hello @nakaxa, how are you?
Although this is the simplest and best approach to command spark the creation of your table, you can check the SparkUI to understand where possible bottlenecks are happening. Check for the jobs and stages where most time is being spend. After that, you can see if to much data is being shuffled through the network. If that's the case, you can increase the size of your workers and enable the disk autoscale on your cluster to process the data faster.
Best,
Alessandro
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-21-2024 05:22 AM - edited 11-21-2024 05:30 AM
The answers here are not correct.
TLDR: _After_ the Spark DF is materialized, saveAsTable takes ages. 35seconds for 1million rows.
saveAsTable() is SLOW - terribly so. Why? Would be nice to get an answer. The workaround is to avoid spark for delta - note I am not using Photon out of reasons. So just writing plain parquet with pyarrow.parquet and then read them with a SQL warehouse into a delta table (using Photon).
I have a tiny arrow data frame with 19 columns and 1million rows. The whole computation takes 2 seconds in stupid python and
take 1 second.
What am I missing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-21-2024 05:42 AM
I have a slight suspicion here that createDataFrame is using the columnar arrow for .display() but when finally writing the row based representation of Spark kicks in and the data is costly reserialized:
I cannot find the right place in the Documentation so I have no reference but it seems:
When creating a DataFrame in Spark, the data is row-based. Spark uses its internal Row or InternalRow objects to represent each record.

