cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to optimize exporting dataframe to delta file?

Frankooo
New Contributor III

Scenario : I have a dataframe that have 5 billion records/rows and 100+ columns. Is there a way to write this in a delta format efficiently. I have tried to export it but cancelled it after 2 hours (write didnt finish) as this processing time is not unacceptable in business perpective. What I did so far is to export using this syntax line . I am using Python btw.

dataframe_name.repartition(50).write.partitionBy(<colA>,<colB>).format("delta").mode('overwrite').option("overwriteSchema", "true").save(<StoragePathLocation>)

Cluster Config:

Standard_DS4_V2 - 28GB Mem, 8 Cores

9 REPLIES 9

Hubert-Dudek
Esteemed Contributor III

As data format is transaction not sure is needed, maybe better just use parquet file.

You can read data source of dataframes in chunks and use append mode.

If data is often updated I think you would need to consider structured streaming.

Hello @Hubert Dudek​  Thank you for your inputs. I haven't heard structured streaming as I only started working on Databricks recently. I am also using ADF to orchestrate the notebook, will this impact the process?

Hubert-Dudek
Esteemed Contributor III

Yes with streaming you still can run it by Azure Data Factory. Just if you want to run stream indefinitely you can not use schedule and need to handle somehow possible failures. If you want to set that stream is finished (timed out) after some time you can set schedule (for example timeout after 23.9 hours, run every day). In fact every case is different.

-werners-
Esteemed Contributor III

Is this a one time operation or do you want to write this 5 billion records regularly?

If it is a one time operation you can use the optimized writes of delta lake:

set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;

set spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true;

Like that the partition size is handled automatically and you can ommit the repartition(50).

Because this repartition will cause lots of shuffles.

Also check if you need the overwriteschema option. I don't know for sure if that has a performance impact but any check that does not have to be executed is gained time.

​New data can be added in append mode or merge mode (if updates are necessary).

If it is a regular recurring job and you have to overwrite the data over and over (or drop and recreate), I do not see a lot of added value in using delta lake. Maybe the automated partition size, and read optimizations. But without knowing the data flow downwards, this is hard to assess.

Frankooo
New Contributor III

Hi @Werner Stinckens​ ,Thank you for you inputs.

This is daily operation unfortunately. Overwrite and autoCompact are already enabled. Will try omitting repartition(50) . I also tried Salting and broadcast join it helps somehow but not making a huge difference.

-werners-
Esteemed Contributor III

Ok, so there are also joins in the picture.

I think you have to check the Spark UI and try to find the issue.

It might be a join which causes this or data skew.

The repartition will cause a shuffle for sure.

Anonymous
Not applicable

If you have 5 billion records, partition by 2 keys may be too much. Each partition should be about 10-50GB. Also, the repartition(50) may create 50 files in each directory.

jose_gonzalez
Moderator
Moderator

Hi @Franco Sia​ ,

I will recommend to avoid to use the repartition(50), instead enable optimizes writes on your Delta table. You can find more details here

Enable optimized writes and auto compaction on your Delta table. Use AQE (docs here) to have enough shuffle partitions and also check your joins in case you have any data skew.

Kaniz
Community Manager
Community Manager

Hi @Franco Sia​ , Just a friendly follow-up. Do you still need help or the above responses help you to find the solution? Please let us know.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.