Why is Delta Lake creating a 238.0TiB shuffle on merge?
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-24-2023 08:28 AM
I'm frankly at a loss here. I have a task that is consistently performing just awfully. I took some time this morning to try and debug it and the physical plan is showing a 238TiB shuffle:
== Physical Plan ==
AdaptiveSparkPlan (40)
+- == Current Plan ==
SerializeFromObject (22)
+- MapPartitions (21)
+- DeserializeToObject (20)
+- Project (19)
+- ObjectHashAggregate (18)
+- Exchange (17)
+- ObjectHashAggregate (16)
+- ObjectHashAggregate (15)
+- ShuffleQueryStage (14), Statistics(sizeInBytes=238.0 TiB)
+- Exchange (13)
+- ObjectHashAggregate (12)
+- * Project (11)
+- CartesianProduct Inner (10)
:- * Project (5)
: +- * Filter (4)
: +- * Project (3)
: +- * ColumnarToRow (2)
: +- Scan parquet (1)
+- * Project (9)
+- * Project (8)
+- * ColumnarToRow (7)
+- Scan parquet (6)I could understand this number if I was working with a lot of data. I'm not. The Cartesian Product in this query produces 125 rows as shown below so it's not my merge logic
Additionally, the output table isn't very big either; it's 15 files with no file larger than 10MB (NOTE: I could definitely do some repartitioning here to have a better setup but that's another story).
I feel like I'm at the end of my wits with this problem. Any ideas would be appreciated.
Labels:
- Labels:
-
Delt Lake
-
MERGE Performance