You can implement custom algorithms for GraphFrames using either Scala/Java or Python APIs. GraphFrames provides some structures to simplify writing graph algorithms; the three primary options are as follow, with the best options first:
- Pregel: This is the most standard graph algorithm framework. If you have a custom algorithm to implement, itโs possible someone has written a Pregel-style example of its implementation.
- AggregateMessages: This is a set of APIs for iterative message-passing algorithms. On each iteration, each vertex can pass messages to its neighbors and aggregate incoming messages. Messages are specified using Apache Spark DataFrame syntax: a message is actually a Column. Examples:
- The user guide has a simple example showing 1 iteration.
- The codebase has a more complex example showing an implementation of Belief Propagation. Note the use of getCachedDataFrame, important for algorithms which run for many iterations.
- DataFrames: Since a GraphFrame is essentially 2 DataFrames (one DataFrame of vertices and one of edges), you can implement algorithms using those DataFrames directly. This can allow the most customization but is generally the most work.
- The code base has a complex example in Connected Components which shows an example of a fairly optimized algorithm. It does local (per-worker task) and global iterations to speed convergence and reduce the number of total global iterations.
When implementing algorithms in GraphFrames, youโll need to be familiar with Apache Spark DataFrames: columns, operations on them, and debugging error messages. Just remember that errors you get in GraphFrames can generally be reproduced via the underlying DataFrame operations, so you can debug with DataFrames (one abstraction lower).
You can also implement algorithms using analogous APIs in GraphX and RDDs. In general, that is less effective nowadays. The main reason to fall back to these RDD-based APIs is to have more low-level control over data locality and movement across the cluster, but you give up a lot of the benefits of modern Spark.