cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Infinity load execution

dener
New Contributor

I am experiencing performance issues when loading a table with 50 million rows into Delta Lake on AWS using Databricks. Despite successfully handling other larger tables, this especific table/process takes hours and doesn't finish. Here's the command I am using:

(df .write .option('overwriteSchema', 'true') .option('mergeSchema', 'true') .save(path=sink, format='delta', mode='overwrite'))

Could you please advise on how to resolve this or optimize the process? Thank you. Best regards, Dener Botta Escaliante Moreira

1 REPLY 1

VZLA
Databricks Employee
Databricks Employee

Thank you for your question! To optimize your Delta Lake write process:

Disable Overhead Options: Avoid overwriteSchema and mergeSchema unless necessary. Use:

df.write.format("delta").mode("overwrite").save(sink)

Increase Parallelism: Use repartition to ensure better resource utilization:

df.repartition(200).write.format("delta").mode("overwrite").save(sink)

Partition Data: Write data using partitions for better scalability:

df.write.partitionBy("column_name").format("delta").mode("overwrite").save(sink)

Optimize Table Post-write: Run Delta optimizations:

OPTIMIZE delta.`<sink_path>`;
VACUUM delta.`<sink_path>`;

Scale Cluster: Use more or larger worker nodes.

Let me know if you need clarification!