cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Self-joins are blocked on remote tables

chris_y_1e
New Contributor II

In our production databricks workflow, we have been getting this error since yesterday in one of the steps:

org.apache.spark.SparkException: Self-joins are blocked on remote tables

We haven't changed our workflow or made any configurations for the databricks. Is there any reason why we are getting this error?

5 REPLIES 5

cgrant
Databricks Employee
Databricks Employee

If you're using dedicated compute, please be aware that

Self-joins are blocked by default when data filtering is called, but you can allow them by setting spark.databricks.remoteFiltering.blockSelfJoins to false on compute you are running these commands on.

Before you enable self-joins on a dedicated compute resource, be aware that a self-join query handled by the data filtering capability could return different snapshots of the same remote table.

 From the documentation here

chris_y_1e
New Contributor II

Setting `blockSelfJoins` to false didn't work. It took over an hour to complete a self-join SQL query that originally ran in seconds. I wonder why the behaviour of the access mode was changed drastically across all our Databricks instances on Azure on 5 Mar? It was working perfectly in all the regions before 5 Mar.

 

TomRenish
New Contributor III

I'm seeing this too, but only on my personal cluster with the following config:

  • Access Mode: Dedicated
  • Policy: Unrestricted
  • Runtime info: 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)

If I use the IT department's shared clusters running either 14.3LTS or 16.2, both of which are running Access Mode = Shared, then the problem goes away.

Condition for the failure is odd.  

  • DF1 = pull some info from VIEW "foo"
  • DF2= pull some different info from VIEW "foo"
  • DF3=DF1.unionByName(DF2)

this would not seem to be a "self-join" as described in the error message, other than the fact that the first 2 dataframes are different cuts of the same view.  

TomRenish
New Contributor III

@chris_y_1e  take a look at your cluster config to see if you're tripping up on the same condition I was.  See my comment, above.

chris_y_1e
New Contributor II

@TomRenish Yeah, we fixed it by changing it to use a shared compute. It is called "USER_ISOLATION" in the `job.yaml` file:

data_security_mode: USER_ISOLATION

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now