Migrating Job Orchestration to Shared Compute and avoiding(?) refactoring

Alex_O — Mon, 19 Feb 2024 18:03:30 GMT

In an effort to migrate our data objects to the Unity Catalog, we must migrate our Job Orchestration to leverage Shared Compute to interact with the 3-namespace hierarchy.

We have some functions and references to code that are outside of the features supported on Shared Compute, as the majority of our pipelines were built leveraging an Unrestricted policy.

Some examples of these are Spark Dataframe methods like "toJSON()" and "toDF()", as well as RDD commands that are leveraged to get at specific elements within a data structure.

When trying to run these commands, this is the error that is returned.

Py4JError: An error occurred while calling o464.toJson. Trace: py4j.security.Py4JSecurityException: Method public java.lang.String com.databricks.backend.common.rpc.CommandContext.toJson() is not whitelisted on class class com.databricks.backend.common.rpc.CommandContext …

My question is two pronged -

Are there work-arounds for white-listing certain methods so that all instances of these and other non-supported references can avoid refactoring? While some online references say yes, everything ~October '23+ says that this is not possible on a Shared Compute. Adding the following to the compute's spark configuration results in a failure to save. (Other documentation/similar questions support this)

spark.databricks.pyspark.enablePy4JSecurity false

If not - what are people's thoughts on how to best get at this information? Future-state there are presumably some best-practices to implement team-wide... but for now, we would need to identify these snippets of code.

The current proposal is:

Traverse our source control for references to these commonly used methods/code snippets
Test and address those changes
Run associated pipelines in lower environments
Continue to address unsupported code until all has been refactored

The concern I have with this is that there are an unknown number of references to an unknown number of unsupported functionality. Can anyone think of a way to quantify this so as to scope out what level of refactor would be required? Is there a magic piece of documentation of what is identified as "unsupported"?

Re: Migrating Job Orchestration to Shared Compute and avoiding(?) refactoring

Alex_O — Tue, 20 Feb 2024 14:12:25 GMT

@Retired_mod

Okay, that makes sense, thank you.

What about the approach to identifying these unsupported methods? Is there any documentation of what is unsupported between Unrestricted and Shared?

topic Migrating Job Orchestration to Shared Compute and avoiding(?) refactoring in Data Engineering

Migrating Job Orchestration to Shared Compute and avoiding(?) refactoring

Re: Migrating Job Orchestration to Shared Compute and avoiding(?) refactoring