Self-joins are blocked on remote tables
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-06-2025 01:53 AM
In our production databricks workflow, we have been getting this error since yesterday in one of the steps:
org.apache.spark.SparkException: Self-joins are blocked on remote tables
We haven't changed our workflow or made any configurations for the databricks. Is there any reason why we are getting this error?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-07-2025 08:58 AM
If you're using dedicated compute, please be aware that
Self-joins are blocked by default when data filtering is called, but you can allow them by setting spark.databricks.remoteFiltering.blockSelfJoins to false on compute you are running these commands on.
Before you enable self-joins on a dedicated compute resource, be aware that a self-join query handled by the data filtering capability could return different snapshots of the same remote table.
From the documentation here
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-07-2025 09:18 AM
Setting `blockSelfJoins` to false didn't work. It took over an hour to complete a self-join SQL query that originally ran in seconds. I wonder why the behaviour of the access mode was changed drastically across all our Databricks instances on Azure on 5 Mar? It was working perfectly in all the regions before 5 Mar.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
I'm seeing this too, but only on my personal cluster with the following config:
- Access Mode: Dedicated
- Policy: Unrestricted
- Runtime info: 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)
If I use the IT department's shared clusters running either 14.3LTS or 16.2, both of which are running Access Mode = Shared, then the problem goes away.
Condition for the failure is odd.
- DF1 = pull some info from VIEW "foo"
- DF2= pull some different info from VIEW "foo"
- DF3=DF1.unionByName(DF2)
this would not seem to be a "self-join" as described in the error message, other than the fact that the first 2 dataframes are different cuts of the same view.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
@chris_y_1e take a look at your cluster config to see if you're tripping up on the same condition I was. See my comment, above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
@TomRenish Yeah, we fixed it by changing it to use a shared compute. It is called "USER_ISOLATION" in the `job.yaml` file:
data_security_mode: USER_ISOLATION

