Databricks

Frankooo · ‎10-14-2021

Scenario : I have a dataframe that have 5 billion records/rows and 100+ columns. Is there a way to write this in a delta format efficiently. I have tried to export it but cancelled it after 2 hours (write didnt finish) as this processing time is not unacceptable in business perpective. What I did so far is to export using this syntax line . I am using Python btw.

dataframe_name.repartition(50).write.partitionBy(<colA>,<colB>).format("delta").mode('overwrite').option("overwriteSchema", "true").save(<StoragePathLocation>)

Cluster Config:

Standard_DS4_V2 - 28GB Mem, 8 Cores

Hubert-Dudek · ‎10-14-2021

As data format is transaction not sure is needed, maybe better just use parquet file.

You can read data source of dataframes in chunks and use append mode.

If data is often updated I think you would need to consider structured streaming.

Frankooo · ‎10-20-2021

Hello @Hubert Dudek Thank you for your inputs. I haven't heard structured streaming as I only started working on Databricks recently. I am also using ADF to orchestrate the notebook, will this impact the process?

Hubert-Dudek · ‎10-20-2021

Yes with streaming you still can run it by Azure Data Factory. Just if you want to run stream indefinitely you can not use schedule and need to handle somehow possible failures. If you want to set that stream is finished (timed out) after some time you can set schedule (for example timeout after 23.9 hours, run every day). In fact every case is different.

-werners- · ‎10-15-2021

Is this a one time operation or do you want to write this 5 billion records regularly?

If it is a one time operation you can use the optimized writes of delta lake:

set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;

set spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true;

Like that the partition size is handled automatically and you can ommit the repartition(50).

Because this repartition will cause lots of shuffles.

Also check if you need the overwriteschema option. I don't know for sure if that has a performance impact but any check that does not have to be executed is gained time.

New data can be added in append mode or merge mode (if updates are necessary).

If it is a regular recurring job and you have to overwrite the data over and over (or drop and recreate), I do not see a lot of added value in using delta lake. Maybe the automated partition size, and read optimizations. But without knowing the data flow downwards, this is hard to assess.

Frankooo · ‎10-20-2021

Hi @Werner Stinckens ,Thank you for you inputs.

This is daily operation unfortunately. Overwrite and autoCompact are already enabled. Will try omitting repartition(50) . I also tried Salting and broadcast join it helps somehow but not making a huge difference.

-werners- · ‎10-20-2021

Ok, so there are also joins in the picture.

I think you have to check the Spark UI and try to find the issue.

It might be a join which causes this or data skew.

The repartition will cause a shuffle for sure.

Anonymous · ‎10-20-2021

If you have 5 billion records, partition by 2 keys may be too much. Each partition should be about 10-50GB. Also, the repartition(50) may create 50 files in each directory.

jose_gonzalez · ‎10-21-2021

Hi @Franco Sia ,

I will recommend to avoid to use the repartition(50), instead enable optimizes writes on your Delta table. You can find more details here

Enable optimized writes and auto compaction on your Delta table. Use AQE (docs here) to have enough shuffle partitions and also check your joins in case you have any data skew.

Kaniz · ‎05-18-2022

Hi @Franco Sia , Just a friendly follow-up. Do you still need help or the above responses help you to find the solution? Please let us know.

Databricks

How to optimize exporting dataframe to delta file?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI