Re: Optimizing a complex pyspark join

Alberto_Umana · ‎12-29-2024

To optimize your complex join in PySpark, you can try the following additional strategies:

Skew Join Optimization Using Skew Hints:

You can use skew hints to inform Spark about the skewed keys. This can help Spark to optimize the join by handling the skewed partitions more efficiently. For example:

SELECT /*+ SKEW('df1', 'main_key') */ df1.id, df2.id

FROM df1

JOIN df2 ON df1.main_key = df2.main_key

AND (df1.col1_is_null OR (df1.col1 = df2.col1))

AND (df1.col2_is_null OR (df1.col2 = df2.col2))

...

Adaptive Query Execution (AQE):

Ensure that AQE is enabled, which you have already done. AQE can dynamically optimize the query plan at runtime, including handling skewed joins.
You can further fine-tune AQE settings, such as adjusting the threshold for skew join detection:

spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")

spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")

Cost-Based Optimizer (CBO):

Enable and utilize the Cost-Based Optimizer to improve join strategies. Ensure that table statistics are up-to-date:

ANALYZE TABLE df1 COMPUTE STATISTICS FOR ALL COLUMNS;

ANALYZE TABLE df2 COMPUTE STATISTICS FOR ALL COLUMNS;

Dynamic Partition Pruning (DPP):

Ensure that Dynamic Partition Pruning is enabled, which can help in optimizing joins by pruning unnecessary partitions:

spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")