cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How can I implement a custom graph algorithm on top of Apache Spark? Can I use GraphX or GraphFrames?

Joseph_B
Databricks Employee
Databricks Employee
1 REPLY 1

Joseph_B
Databricks Employee
Databricks Employee

You can implement custom algorithms for GraphFrames using either Scala/Java or Python APIs. GraphFrames provides some structures to simplify writing graph algorithms; the three primary options are as follow, with the best options first:

  • Pregel: This is the most standard graph algorithm framework. If you have a custom algorithm to implement, itโ€™s possible someone has written a Pregel-style example of its implementation.
  • AggregateMessages: This is a set of APIs for iterative message-passing algorithms. On each iteration, each vertex can pass messages to its neighbors and aggregate incoming messages. Messages are specified using Apache Spark DataFrame syntax: a message is actually a Column. Examples:
    • The user guide has a simple example showing 1 iteration.
    • The codebase has a more complex example showing an implementation of Belief Propagation. Note the use of getCachedDataFrame, important for algorithms which run for many iterations.
  • DataFrames: Since a GraphFrame is essentially 2 DataFrames (one DataFrame of vertices and one of edges), you can implement algorithms using those DataFrames directly. This can allow the most customization but is generally the most work.
    • The code base has a complex example in Connected Components which shows an example of a fairly optimized algorithm. It does local (per-worker task) and global iterations to speed convergence and reduce the number of total global iterations.

When implementing algorithms in GraphFrames, youโ€™ll need to be familiar with Apache Spark DataFrames: columns, operations on them, and debugging error messages. Just remember that errors you get in GraphFrames can generally be reproduced via the underlying DataFrame operations, so you can debug with DataFrames (one abstraction lower).

You can also implement algorithms using analogous APIs in GraphX and RDDs. In general, that is less effective nowadays. The main reason to fall back to these RDD-based APIs is to have more low-level control over data locality and movement across the cluster, but you give up a lot of the benefits of modern Spark.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group