cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Performance Issue : Create DELTA table form 2 TB PARQUET file

KuldeepChitraka
New Contributor III

We are trying to create a DELTA table (CTAS statement) from 2 TB PARQUET file and its taking huge amount of time around 12~ hrs.

is it normal.? What are option to tune/optimize this ? are we doing anything wrong

Cluster : Interactive/30 Cores / 320 GB Memory / 4 workers

3 REPLIES 3

shan_chandra
Honored Contributor III
Honored Contributor III

@Kuldeep Chitrakar​ - Please try to evaluate(explain plan) the physical plan on the CTAS query before creating the table. Below are a few things that can be validated before turning the cluster size.

  1. validate the join conditions used in CTAS query.
  2. will a plain select query work?
  3. Tuning spark.sql.shuffle.partitions to see if more number of tasks are spun in parallel to reduce the time taken.
  4. Is there a skew in the join?
  5. will AQE config help? (https://docs.databricks.com/optimizations/aqe.html)

Cami
Contributor III

I do not have experience with 2TB dataset but I recommend you check it out:

  1. spark.sql.shuffle.partitions ( doc examples: Link 1, Link 2 )
  2. Tune file size

Can you share with us a screen from SPARK UI for CTAS statement ( SPARK UI ->STAGES -> select CTAS -> Summary metrics and Aggregated metrics )?

Can you check the size of the parquet files created under the delta table.?

Hubert-Dudek
Esteemed Contributor III

Please use COPY INTO (first create an empty delta table) or CONVERT TO DELTA instead of CTAS it will be much more faster, and it process will be auto-optimized.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.