Databricks Community

nakaxa · ‎05-20-2024

I read a huge array with several columns into memory, then I convert it into a spark dataframe, when I want to write to a delta table it using the following command it takes forever (I have a driver with large memory and 32 workers) : df_exp.write.mode("append").format("delta").saveAsTable(save_table_name) How can I write this the fastest possible to a delta table?

raphaelblg · ‎05-20-2024

Hello @nakaxa ,

Spark lazily evaluates its plans, and based on your issue description, it appears that the dataframe's origin is not Spark itself. Since Spark commands are lazily evaluated, I suspect that the time-consuming aspect is not the write itself but the preceding operations.

If your data source is in-memory (driver memory) and you're transforming it into a Spark dataframe, all processing before the write operation occurs on the driver node. This node then shuffles the data between the 32 executors before performing the write, thereby benefiting from Spark's parallelism.

If you want to benefit from Spark parallelism and performance throughout your whole job, avoid using non-spark datasets and these kind of conversions.

Please let me know if my answer is helpful for your case.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

anardinelli · ‎05-20-2024

Hello @nakaxa, how are you?

Although this is the simplest and best approach to command spark the creation of your table, you can check the SparkUI to understand where possible bottlenecks are happening. Check for the jobs and stages where most time is being spend. After that, you can see if to much data is being shuffled through the network. If that's the case, you can increase the size of your workers and enable the disk autoscale on your cluster to process the data faster.

Best,

Alessandro

Reiska · ‎11-21-2024

The answers here are not correct.

TLDR: _After_ the Spark DF is materialized, saveAsTable takes ages. 35seconds for 1million rows.

saveAsTable() is SLOW - terribly so. Why? Would be nice to get an answer. The workaround is to avoid spark for delta - note I am not using Photon out of reasons. So just writing plain parquet with pyarrow.parquet and then read them with a SQL warehouse into a delta table (using Photon).

I have a tiny arrow data frame with 19 columns and 1million rows. The whole computation takes 2 seconds in stupid python and

spark_df = spark.createDataFrame(data.to_pandas())

spark_df.display()

take 1 second.

Then comes

spark_df.write.format("delta").mode("append").saveAsTable("default.hello_sleepy")

with a whopping 35 seconds?! What is that? Running this single threaded with delta-io writes instantly. Also pyarrow.parquet.write_table take a second. But saveAsTable 35? What is going on here?

When it is figured out to run the calculation equally fast single threaded on Databricks Spark as on a Raspberry Pi - then I would like to run this on worker executors for 15000 files in parallel. Actually this whole exercies might be better done in Lambda, but still it should be possible.

What am I missing?

Reiska · ‎11-21-2024

I have a slight suspicion here that createDataFrame is using the columnar arrow for .display() but when finally writing the row based representation of Spark kicks in and the data is costly reserialized:

I cannot find the right place in the Documentation so I have no reference but it seems:
When creating a DataFrame in Spark, the data is row-based. Spark uses its internal Row or InternalRow objects to represent each record.

Databricks Community

Fastest way to write a Spark Dataframe to a delta table

Photos

Connect with Databricks Users in Your Area

Virtual Learning Festival: 9 April - 30 April

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks DevConnect: Global Community Meetups for Data Engineers

Databricks Community Champion - February 2025 - Stefan Koch