Databricks Community

chris_y_1e · ‎03-06-2025

In our production databricks workflow, we have been getting this error since yesterday in one of the steps:

org.apache.spark.SparkException: Self-joins are blocked on remote tables

We haven't changed our workflow or made any configurations for the databricks. Is there any reason why we are getting this error?

cgrant · ‎03-07-2025

If you're using dedicated compute, please be aware that

Self-joins are blocked by default when data filtering is called, but you can allow them by setting spark.databricks.remoteFiltering.blockSelfJoins to false on compute you are running these commands on.

Before you enable self-joins on a dedicated compute resource, be aware that a self-join query handled by the data filtering capability could return different snapshots of the same remote table.

From the documentation here

chris_y_1e · ‎03-07-2025

Setting `blockSelfJoins` to false didn't work. It took over an hour to complete a self-join SQL query that originally ran in seconds. I wonder why the behaviour of the access mode was changed drastically across all our Databricks instances on Azure on 5 Mar? It was working perfectly in all the regions before 5 Mar.

TomRenish · ‎03-19-2025

I'm seeing this too, but only on my personal cluster with the following config:

Access Mode: Dedicated
Policy: Unrestricted
Runtime info: 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)

If I use the IT department's shared clusters running either 14.3LTS or 16.2, both of which are running Access Mode = Shared, then the problem goes away.

Condition for the failure is odd.

DF1 = pull some info from VIEW "foo"
DF2= pull some different info from VIEW "foo"
DF3=DF1.unionByName(DF2)

this would not seem to be a "self-join" as described in the error message, other than the fact that the first 2 dataframes are different cuts of the same view.