cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How can I implement a custom graph algorithm on top of Apache Spark? Can I use GraphX or GraphFrames?

Joseph_B
New Contributor III
New Contributor III
1 REPLY 1

Joseph_B
New Contributor III
New Contributor III

You can implement custom algorithms for GraphFrames using either Scala/Java or Python APIs. GraphFrames provides some structures to simplify writing graph algorithms; the three primary options are as follow, with the best options first:

  • Pregel: This is the most standard graph algorithm framework. If you have a custom algorithm to implement, itโ€™s possible someone has written a Pregel-style example of its implementation.
  • AggregateMessages: This is a set of APIs for iterative message-passing algorithms. On each iteration, each vertex can pass messages to its neighbors and aggregate incoming messages. Messages are specified using Apache Spark DataFrame syntax: a message is actually a Column. Examples:
    • The user guide has a simple example showing 1 iteration.
    • The codebase has a more complex example showing an implementation of Belief Propagation. Note the use of getCachedDataFrame, important for algorithms which run for many iterations.
  • DataFrames: Since a GraphFrame is essentially 2 DataFrames (one DataFrame of vertices and one of edges), you can implement algorithms using those DataFrames directly. This can allow the most customization but is generally the most work.
    • The code base has a complex example in Connected Components which shows an example of a fairly optimized algorithm. It does local (per-worker task) and global iterations to speed convergence and reduce the number of total global iterations.

When implementing algorithms in GraphFrames, youโ€™ll need to be familiar with Apache Spark DataFrames: columns, operations on them, and debugging error messages. Just remember that errors you get in GraphFrames can generally be reproduced via the underlying DataFrame operations, so you can debug with DataFrames (one abstraction lower).

You can also implement algorithms using analogous APIs in GraphX and RDDs. In general, that is less effective nowadays. The main reason to fall back to these RDD-based APIs is to have more low-level control over data locality and movement across the cluster, but you give up a lot of the benefits of modern Spark.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.