cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Fastest way to write a Spark Dataframe to a delta table

nakaxa
New Contributor

I read a huge array with several columns into memory, then I convert it into a spark dataframe,  when I want to write to a delta table it using the following command it takes forever (I have a driver with large memory and 32 workers) : df_exp.write.mode("append").format("delta").saveAsTable(save_table_name) How can I write this the fastest possible to a delta table?

4 REPLIES 4

raphaelblg
Databricks Employee
Databricks Employee

 

Hello @nakaxa ,

Spark lazily evaluates its plans, and based on your issue description, it appears that the dataframe's origin is not Spark itself. Since Spark commands are lazily evaluated, I suspect that the time-consuming aspect is not the write itself but the preceding operations.

If your data source is in-memory (driver memory) and you're transforming it into a Spark dataframe, all processing before the write operation occurs on the driver node. This node then shuffles the data between the 32 executors before performing the write, thereby benefiting from Spark's parallelism.

If you want to benefit from Spark parallelism and performance throughout your whole job, avoid using non-spark datasets and these kind of conversions.

Please let me know if my answer is helpful for your case.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

anardinelli
Databricks Employee
Databricks Employee

Hello @nakaxa, how are you?

Although this is the simplest and best approach to command spark the creation of your table, you can check the SparkUI to understand where possible bottlenecks are happening. Check for the jobs and stages where most time is being spend. After that, you can see if to much data is being shuffled through the network. If that's the case, you can increase the size of your workers and enable the disk autoscale on your cluster to process the data faster.

Best,

Alessandro

Reiska
New Contributor II

The answers here are not correct.

TLDR: _After_ the Spark DF is materialized, saveAsTable takes ages. 35seconds for 1million rows.


saveAsTable() is SLOW - terribly so. Why? Would be nice to get an answer. The workaround is to avoid spark for delta - note I am not using Photon out of reasons. So just writing plain parquet with pyarrow.parquet and then read them with a SQL warehouse into a delta table (using Photon).

I have a tiny arrow data frame with 19 columns and 1million rows. The whole computation takes 2 seconds in stupid python and

spark_df = spark.createDataFrame(data.to_pandas())
spark_df.display()

take 1 second.
 
Then comes

spark_df.write.format("delta").mode("append").saveAsTable("default.hello_sleepy")
 
with a whopping 35 seconds?! What is that? Running this single threaded with delta-io writes instantly. Also pyarrow.parquet.write_table take a second. But saveAsTable 35? What is going on here?
 
When it is figured out to run the calculation equally fast single threaded on Databricks Spark as on a Raspberry Pi - then I would like to run this on worker executors for 15000 files in parallel. Actually this whole exercies might be better done in Lambda, but still it should be possible. 

What am I missing?

Reiska
New Contributor II

I have a slight suspicion here that createDataFrame is using the columnar arrow for .display() but when finally writing the row based representation of Spark kicks in and the data is costly reserialized:

I cannot find the right place in the Documentation so I have no reference but it seems:
When creating a DataFrame in Spark, the data is row-based. Spark uses its internal Row or InternalRow objects to represent each record.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group