Databricks Community

KuldeepChitraka · ‎01-31-2023

We are trying to create a DELTA table (CTAS statement) from 2 TB PARQUET file and its taking huge amount of time around 12~ hrs.

is it normal.? What are option to tune/optimize this ? are we doing anything wrong

Cluster : Interactive/30 Cores / 320 GB Memory / 4 workers

shan_chandra · ‎01-31-2023

@Kuldeep Chitrakar - Please try to evaluate(explain plan) the physical plan on the CTAS query before creating the table. Below are a few things that can be validated before turning the cluster size.

validate the join conditions used in CTAS query.
will a plain select query work?
Tuning spark.sql.shuffle.partitions to see if more number of tasks are spun in parallel to reduce the time taken.
Is there a skew in the join?
will AQE config help? (https://docs.databricks.com/optimizations/aqe.html)

Cami · ‎01-31-2023

I do not have experience with 2TB dataset but I recommend you check it out:

spark.sql.shuffle.partitions ( doc examples: Link 1, Link 2 )
Tune file size

Can you share with us a screen from SPARK UI for CTAS statement ( SPARK UI ->STAGES -> select CTAS -> Summary metrics and Aggregated metrics )?

Can you check the size of the parquet files created under the delta table.?

Hubert-Dudek · ‎01-31-2023

Please use COPY INTO (first create an empty delta table) or CONVERT TO DELTA instead of CTAS it will be much more faster, and it process will be auto-optimized.

Databricks Community

Performance Issue : Create DELTA table form 2 TB PARQUET file

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!