- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-05-2024 06:07 AM
The recursive join is definitely a performance killer. It will blow up the query plan.
So I would advice against using it.
Alternatives? Well, a fixed amount of joins for example, if that is an option of course.
Using a graph algorithm is also an option.
It is important that you figure out what kind of graph you have, or even multiple graphs (is it directed, are all edges connected, acyclic or not, do you want to visit all edges and vertices etc).
Once you have that, you have the choice of either:
- use graphframes/graphx in spark (not easy to use!)
- use pure python with some graph processing package (only an option if the amount of data is reasonable)
- use some kind of graph software outside of databricks
IIRC there was some talk of introducing Cypher (of Neo4j) into spark or databricks but that apparently never happened.