cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Migrating Job Orchestration to Shared Compute and avoiding(?) refactoring

Alex_O
New Contributor II

In an effort to migrate our data objects to the Unity Catalog, we must migrate our Job Orchestration to leverage Shared Compute to interact with the 3-namespace hierarchy.

We have some functions and references to code that are outside of the features supported on Shared Compute, as the majority of our pipelines were built leveraging an Unrestricted policy.

Some examples of these are Spark Dataframe methods like "toJSON()" and "toDF()", as well as RDD commands that are leveraged to get at specific elements within a data structure.

When trying to run these commands, this is the error that is returned.

 

Py4JError: An error occurred while calling o464.toJson. Trace: py4j.security.Py4JSecurityException: Method public java.lang.String com.databricks.backend.common.rpc.CommandContext.toJson() is not whitelisted on class class com.databricks.backend.common.rpc.CommandContext …

 

 My question is two pronged - 

Are there work-arounds for white-listing certain methods so that all instances of these and other non-supported references can avoid refactoring? While some online references say yes, everything ~October '23+ says that this is not possible on a Shared Compute. Adding the following to the compute's spark configuration results in a failure to save. (Other documentation/similar questions support this)

 

spark.databricks.pyspark.enablePy4JSecurity false

 

If not - what are people's thoughts on how to best get at this information? Future-state there are presumably some best-practices to implement team-wide... but for now, we would need to identify these snippets of code.

The current proposal is:

  1. Traverse our source control for references to these commonly used methods/code snippets
  2. Test and address those changes
  3. Run associated pipelines in lower environments
  4. Continue to address unsupported code until all has been refactored

The concern I have with this is that there are an unknown number of references to an unknown number of unsupported functionality. Can anyone think of a way to quantify this so as to scope out what level of refactor would be required? Is there a magic piece of documentation of what is identified as "unsupported"?

 

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Alex_O,

  1. Whitelisting Methods on Shared Compute:

    • Unfortunately, directly whitelisting specific methods like “toJSON()” and “toDF()” for Spark DataFrames on Shared Compute is not supported. As of October 2023, Shared Compute environments have restrictions in place to maintain security and stability.
    • The configuration you mentioned, spark.databricks.pyspark.enablePy4JSecurity false, won’t work for Shared Compute clusters. It’s designed for other contexts, but not for Shared Compute.

Feel free to reach out if you have further questions or need additional guidance! 🚀

Alex_O
New Contributor II

@Kaniz 

Okay, that makes sense, thank you.

What about the approach to identifying these unsupported methods? Is there any documentation of what is unsupported between Unrestricted and Shared?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.