Databricks Community

Alex_O · ‎02-19-2024

In an effort to migrate our data objects to the Unity Catalog, we must migrate our Job Orchestration to leverage Shared Compute to interact with the 3-namespace hierarchy.

We have some functions and references to code that are outside of the features supported on Shared Compute, as the majority of our pipelines were built leveraging an Unrestricted policy.

Some examples of these are Spark Dataframe methods like "toJSON()" and "toDF()", as well as RDD commands that are leveraged to get at specific elements within a data structure.

When trying to run these commands, this is the error that is returned.

Py4JError: An error occurred while calling o464.toJson. Trace: py4j.security.Py4JSecurityException: Method public java.lang.String com.databricks.backend.common.rpc.CommandContext.toJson() is not whitelisted on class class com.databricks.backend.common.rpc.CommandContext …

My question is two pronged -

Are there work-arounds for white-listing certain methods so that all instances of these and other non-supported references can avoid refactoring? While some online references say yes, everything ~October '23+ says that this is not possible on a Shared Compute. Adding the following to the compute's spark configuration results in a failure to save. (Other documentation/similar questions support this)

spark.databricks.pyspark.enablePy4JSecurity false

If not - what are people's thoughts on how to best get at this information? Future-state there are presumably some best-practices to implement team-wide... but for now, we would need to identify these snippets of code.

The current proposal is:

Traverse our source control for references to these commonly used methods/code snippets
Test and address those changes
Run associated pipelines in lower environments
Continue to address unsupported code until all has been refactored

The concern I have with this is that there are an unknown number of references to an unknown number of unsupported functionality. Can anyone think of a way to quantify this so as to scope out what level of refactor would be required? Is there a magic piece of documentation of what is identified as "unsupported"?