Performance Issue : Create DELTA table form 2 TB PARQUET file
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 08:08 AM
We are trying to create a DELTA table (CTAS statement) from 2 TB PARQUET file and its taking huge amount of time around 12~ hrs.
is it normal.? What are option to tune/optimize this ? are we doing anything wrong
Cluster : Interactive/30 Cores / 320 GB Memory / 4 workers
- Labels:
-
Ctas
-
Delta
-
Performance Issue
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 10:01 AM
@Kuldeep Chitrakar - Please try to evaluate(explain plan) the physical plan on the CTAS query before creating the table. Below are a few things that can be validated before turning the cluster size.
- validate the join conditions used in CTAS query.
- will a plain select query work?
- Tuning spark.sql.shuffle.partitions to see if more number of tasks are spun in parallel to reduce the time taken.
- Is there a skew in the join?
- will AQE config help? (https://docs.databricks.com/optimizations/aqe.html)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 10:18 AM
I do not have experience with 2TB dataset but I recommend you check it out:
- spark.sql.shuffle.partitions ( doc examples: Link 1, Link 2 )
- Tune file size
Can you share with us a screen from SPARK UI for CTAS statement ( SPARK UI ->STAGES -> select CTAS -> Summary metrics and Aggregated metrics )?
Can you check the size of the parquet files created under the delta table.?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 10:58 AM
Please use COPY INTO (first create an empty delta table) or CONVERT TO DELTA instead of CTAS it will be much more faster, and it process will be auto-optimized.

