Re: Optimizing recursive joins on group and UNION-...

-werners- · ‎09-05-2024

The recursive join is definitely a performance killer. It will blow up the query plan.
So I would advice against using it.
Alternatives? Well, a fixed amount of joins for example, if that is an option of course.
Using a graph algorithm is also an option.
It is important that you figure out what kind of graph you have, or even multiple graphs (is it directed, are all edges connected, acyclic or not, do you want to visit all edges and vertices etc).
Once you have that, you have the choice of either:
- use graphframes/graphx in spark (not easy to use!)
- use pure python with some graph processing package (only an option if the amount of data is reasonable)
- use some kind of graph software outside of databricks

IIRC there was some talk of introducing Cypher (of Neo4j) into spark or databricks but that apparently never happened.

View solution in original post