cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Lake Spark fails to write _delta_log via a Notebook without granting the Notebook data access

caldempsey
New Contributor

I have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table.I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location when writing directly to a filesystem.

This behaviour seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data. I've put up a repository with the issue for clarity. https://github.com/caldempsey/docker-notebook-spark-s3


Steps to reproduce

```
1. Create a Spark Cluster (locally) and connect a Jupyter Notebook to the Cluster.
2. Write a Delta Table to the Spark filesystem (lets say `/out`)
3. Spark will write the Parquet files to `/out` but error unless the notebook is given access to `/out` despite being registered as an application and not a worker.
```


Via the repo provided:
```
1. Clone the repo
2. Remove [infra-delta-lake/localhost/docker-compose.yml:63](https://github.com/caldempsey/docker-notebook-spark-s3/blob/d17f7963437215346a04450544b85770e3c5ed8f...) `./../../notebook-data-lake/data:/data`, which prevents the notebook from accessing the `/data` target shared with the Spark Master and Workers on their local filesystem.
```

Observed results

When the notebook has access to /data (but is a connected application not a member of the cluster), Delta Tables write successfully with _delta_log.

```
Py4JJavaError: An error occurred while calling o56.save.
: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log
```

When the notebook does not have access to /data it complains that it can't write _delta_log, but parquet files still get written!

TLDR

I expect the `_delta_log` to be written regardless of whether the Notebook has access to the target filesystem. This is not the case. I can see 2 reasons in my mind why this might happen.

1. The Notebook is being used as part of the workers and writes are being made from the Notebook too.

2. There's a bug in Delta Lake's latest version where the PySpark callsite needs to have access to the data the Spark Cluster is writing to in order to complet a write of `_delta_log`

Both of these don't really make sense. I've checked 1. and the Notebook looks to be registered as an application. Can anyone help?

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group