cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Migrating Job Orchestration to Shared Compute and avoiding(?) refactoring

Alex_O
New Contributor II

In an effort to migrate our data objects to the Unity Catalog, we must migrate our Job Orchestration to leverage Shared Compute to interact with the 3-namespace hierarchy.

We have some functions and references to code that are outside of the features supported on Shared Compute, as the majority of our pipelines were built leveraging an Unrestricted policy.

Some examples of these are Spark Dataframe methods like "toJSON()" and "toDF()", as well as RDD commands that are leveraged to get at specific elements within a data structure.

When trying to run these commands, this is the error that is returned.

 

Py4JError: An error occurred while calling o464.toJson. Trace: py4j.security.Py4JSecurityException: Method public java.lang.String com.databricks.backend.common.rpc.CommandContext.toJson() is not whitelisted on class class com.databricks.backend.common.rpc.CommandContext …

 

 My question is two pronged - 

Are there work-arounds for white-listing certain methods so that all instances of these and other non-supported references can avoid refactoring? While some online references say yes, everything ~October '23+ says that this is not possible on a Shared Compute. Adding the following to the compute's spark configuration results in a failure to save. (Other documentation/similar questions support this)

 

spark.databricks.pyspark.enablePy4JSecurity false

 

If not - what are people's thoughts on how to best get at this information? Future-state there are presumably some best-practices to implement team-wide... but for now, we would need to identify these snippets of code.

The current proposal is:

  1. Traverse our source control for references to these commonly used methods/code snippets
  2. Test and address those changes
  3. Run associated pipelines in lower environments
  4. Continue to address unsupported code until all has been refactored

The concern I have with this is that there are an unknown number of references to an unknown number of unsupported functionality. Can anyone think of a way to quantify this so as to scope out what level of refactor would be required? Is there a magic piece of documentation of what is identified as "unsupported"?

 

1 REPLY 1

Alex_O
New Contributor II

@Retired_mod 

Okay, that makes sense, thank you.

What about the approach to identifying these unsupported methods? Is there any documentation of what is unsupported between Unrestricted and Shared?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group