Hi @maarko
This is a fascinating issue that points to several potential causes related to differences between
Photon and standard Spark execution engines, particularly around decimal handling and parallelism.
Root Causes
1. Decimal Precision and Scale Handling
The different decimal types (21,5) vs (38,6) can cause implicit conversions that behave differently between Photon and standard Spark. Even after explicit casting, the underlying comparison logic may differ.
2. Parallel Execution Non-Determinism
Non-Photon clusters use standard Spark's parallel execution, which can introduce non-deterministic behavior when:
- Hash-based operations encounter hash collisions
- Floating-point arithmetic operations are reordered
- Memory pressure causes different execution paths
3. NULL Handling in LEFT OUTER JOIN
With LEFT OUTER JOIN, B.COL2 can be NULL for unmatched rows.
The comparison A.COL1 != B.COL2 behaves differently when NULLs are involved, and this handling might differ between engines.
Recommended Solutions
Solution 1: Explicit NULL Handling
Solution 2: Use COALESCE for Consistent Comparison
Solution 3: Deterministic Ordering
Solution 4: Repartition for Consistency
LR