Databricks Community

caldempsey · ‎03-02-2024

I have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table.I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location when writing directly to a filesystem.

This behaviour seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data. I've put up a repository with the issue for clarity. https://github.com/caldempsey/docker-notebook-spark-s3

Steps to reproduce

```
1. Create a Spark Cluster (locally) and connect a Jupyter Notebook to the Cluster.
2. Write a Delta Table to the Spark filesystem (lets say `/out`)
3. Spark will write the Parquet files to `/out` but error unless the notebook is given access to `/out` despite being registered as an application and not a worker.
```

Via the repo provided:
```
1. Clone the repo
2. Remove [infra-delta-lake/localhost/docker-compose.yml:63](https://github.com/caldempsey/docker-notebook-spark-s3/blob/d17f7963437215346a04450544b85770e3c5ed8f...) `./../../notebook-data-lake/data:/data`, which prevents the notebook from accessing the `/data` target shared with the Spark Master and Workers on their local filesystem.
```

Observed results

When the notebook has access to /data (but is a connected application not a member of the cluster), Delta Tables write successfully with _delta_log.

```
Py4JJavaError: An error occurred while calling o56.save.
: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log
```

When the notebook does not have access to /data it complains that it can't write _delta_log, but parquet files still get written!

TLDR

I expect the `_delta_log` to be written regardless of whether the Notebook has access to the target filesystem. This is not the case. I can see 2 reasons in my mind why this might happen.

1. The Notebook is being used as part of the workers and writes are being made from the Notebook too.

2. There's a bug in Delta Lake's latest version where the PySpark callsite needs to have access to the data the Spark Cluster is writing to in order to complet a write of `_delta_log`

Both of these don't really make sense. I've checked 1. and the Notebook looks to be registered as an application. Can anyone help?

Databricks Community

Delta Lake Spark fails to write _delta_log via a Notebook without granting the Notebook data access

🌟 Community Pulse: Your Weekly Roundup! July 06 – 12, 2026

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions

How to Optimize Your Content for GEO: Best Practices for Writing Discoverable Community Content

Solution Accelerator Series | Building Common Sense Product Recommendations With LLMs

Databricks Community Fellows – June 2026 Recap