cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Fastest way to write a Spark Dataframe to a delta table

nakaxa
New Contributor

I read a huge array with several columns into memory, then I convert it into a spark dataframe,  when I want to write to a delta table it using the following command it takes forever (I have a driver with large memory and 32 workers) : df_exp.write.mode("append").format("delta").saveAsTable(save_table_name) How can I write this the fastest possible to a delta table?

2 REPLIES 2

raphaelblg
Honored Contributor
Honored Contributor

 

Hello @nakaxa ,

Spark lazily evaluates its plans, and based on your issue description, it appears that the dataframe's origin is not Spark itself. Since Spark commands are lazily evaluated, I suspect that the time-consuming aspect is not the write itself but the preceding operations.

If your data source is in-memory (driver memory) and you're transforming it into a Spark dataframe, all processing before the write operation occurs on the driver node. This node then shuffles the data between the 32 executors before performing the write, thereby benefiting from Spark's parallelism.

If you want to benefit from Spark parallelism and performance throughout your whole job, avoid using non-spark datasets and these kind of conversions.

Please let me know if my answer is helpful for your case.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

anardinelli
New Contributor III
New Contributor III

Hello @nakaxa, how are you?

Although this is the simplest and best approach to command spark the creation of your table, you can check the SparkUI to understand where possible bottlenecks are happening. Check for the jobs and stages where most time is being spend. After that, you can see if to much data is being shuffled through the network. If that's the case, you can increase the size of your workers and enable the disk autoscale on your cluster to process the data faster.

Best,

Alessandro

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group