Infinity load execution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-29-2024 04:49 AM
I am experiencing performance issues when loading a table with 50 million rows into Delta Lake on AWS using Databricks. Despite successfully handling other larger tables, this especific table/process takes hours and doesn't finish. Here's the command I am using:
(df .write .option('overwriteSchema', 'true') .option('mergeSchema', 'true') .save(path=sink, format='delta', mode='overwrite'))
Could you please advise on how to resolve this or optimize the process? Thank you. Best regards, Dener Botta Escaliante Moreira
- Labels:
-
Delta Lake
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2024 10:06 AM
Thank you for your question! To optimize your Delta Lake write process:
Disable Overhead Options: Avoid overwriteSchema and mergeSchema unless necessary. Use:
df.write.format("delta").mode("overwrite").save(sink)
Increase Parallelism: Use repartition to ensure better resource utilization:
df.repartition(200).write.format("delta").mode("overwrite").save(sink)
Partition Data: Write data using partitions for better scalability:
df.write.partitionBy("column_name").format("delta").mode("overwrite").save(sink)
Optimize Table Post-write: Run Delta optimizations:
OPTIMIZE delta.`<sink_path>`;
VACUUM delta.`<sink_path>`;
Scale Cluster: Use more or larger worker nodes.
Let me know if you need clarification!

