Databricks Community

dener · ‎08-29-2024

I am experiencing performance issues when loading a table with 50 million rows into Delta Lake on AWS using Databricks. Despite successfully handling other larger tables, this especific table/process takes hours and doesn't finish. Here's the command I am using:

(df .write .option('overwriteSchema', 'true') .option('mergeSchema', 'true') .save(path=sink, format='delta', mode='overwrite'))

Could you please advise on how to resolve this or optimize the process? Thank you. Best regards, Dener Botta Escaliante Moreira

VZLA · ‎12-27-2024

Thank you for your question! To optimize your Delta Lake write process:

Disable Overhead Options: Avoid overwriteSchema and mergeSchema unless necessary. Use:

df.write.format("delta").mode("overwrite").save(sink)

Increase Parallelism: Use repartition to ensure better resource utilization:

df.repartition(200).write.format("delta").mode("overwrite").save(sink)

Partition Data: Write data using partitions for better scalability:

df.write.partitionBy("column_name").format("delta").mode("overwrite").save(sink)

Optimize Table Post-write: Run Delta optimizations:

OPTIMIZE delta.`<sink_path>`;
VACUUM delta.`<sink_path>`;

Scale Cluster: Use more or larger worker nodes.

Let me know if you need clarification!

Databricks Community

Infinity load execution

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples