Infinity load execution

dener — Thu, 29 Aug 2024 11:49:38 GMT

I am experiencing performance issues when loading a table with 50 million rows into Delta Lake on AWS using Databricks. Despite successfully handling other larger tables, this especific table/process takes hours and doesn't finish. Here's the command I am using:

(df .write .option('overwriteSchema', 'true') .option('mergeSchema', 'true') .save(path=sink, format='delta', mode='overwrite'))

Could you please advise on how to resolve this or optimize the process? Thank you. Best regards, Dener Botta Escaliante Moreira

Re: Infinity load execution

VZLA — Fri, 27 Dec 2024 18:06:50 GMT

Thank you for your question! To optimize your Delta Lake write process:

Disable Overhead Options: Avoid overwriteSchema and mergeSchema unless necessary. Use:

df.write.format("delta").mode("overwrite").save(sink)

Increase Parallelism: Use repartition to ensure better resource utilization:

df.repartition(200).write.format("delta").mode("overwrite").save(sink)

Partition Data: Write data using partitions for better scalability:

df.write.partitionBy("column_name").format("delta").mode("overwrite").save(sink)

Optimize Table Post-write: Run Delta optimizations:

OPTIMIZE delta.`<sink_path>`; VACUUM delta.`<sink_path>`;

Scale Cluster: Use more or larger worker nodes.

Let me know if you need clarification!

topic Re: Infinity load execution in Data Engineering

Infinity load execution

Re: Infinity load execution