cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Migrating Job Orchestration to Shared Compute and avoiding(?) refactoring

Alex_O
New Contributor II

In an effort to migrate our data objects to the Unity Catalog, we must migrate our Job Orchestration to leverage Shared Compute to interact with the 3-namespace hierarchy.

We have some functions and references to code that are outside of the features supported on Shared Compute, as the majority of our pipelines were built leveraging an Unrestricted policy.

Some examples of these are Spark Dataframe methods like "toJSON()" and "toDF()", as well as RDD commands that are leveraged to get at specific elements within a data structure.

When trying to run these commands, this is the error that is returned.

 

Py4JError: An error occurred while calling o464.toJson. Trace: py4j.security.Py4JSecurityException: Method public java.lang.String com.databricks.backend.common.rpc.CommandContext.toJson() is not whitelisted on class class com.databricks.backend.common.rpc.CommandContext …

 

 My question is two pronged - 

Are there work-arounds for white-listing certain methods so that all instances of these and other non-supported references can avoid refactoring? While some online references say yes, everything ~October '23+ says that this is not possible on a Shared Compute. Adding the following to the compute's spark configuration results in a failure to save. (Other documentation/similar questions support this)

 

spark.databricks.pyspark.enablePy4JSecurity false

 

If not - what are people's thoughts on how to best get at this information? Future-state there are presumably some best-practices to implement team-wide... but for now, we would need to identify these snippets of code.

The current proposal is:

  1. Traverse our source control for references to these commonly used methods/code snippets
  2. Test and address those changes
  3. Run associated pipelines in lower environments
  4. Continue to address unsupported code until all has been refactored

The concern I have with this is that there are an unknown number of references to an unknown number of unsupported functionality. Can anyone think of a way to quantify this so as to scope out what level of refactor would be required? Is there a magic piece of documentation of what is identified as "unsupported"?

 

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Alex_O,

  1. Whitelisting Methods on Shared Compute:

    • Unfortunately, directly whitelisting specific methods like “toJSON()” and “toDF()” for Spark DataFrames on Shared Compute is not supported. As of October 2023, Shared Compute environments have restrictions in place to maintain security and stability.
    • The configuration you mentioned, spark.databricks.pyspark.enablePy4JSecurity false, won’t work for Shared Compute clusters. It’s designed for other contexts, but not for Shared Compute.

Feel free to reach out if you have further questions or need additional guidance! 🚀

Alex_O
New Contributor II

@Kaniz_Fatma 

Okay, that makes sense, thank you.

What about the approach to identifying these unsupported methods? Is there any documentation of what is unsupported between Unrestricted and Shared?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!