Databricks Community

caldempsey · ‎03-02-2024

I have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table.I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location when writing directly to a filesystem.

This behaviour seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data. I've put up a repository with the issue for clarity. https://github.com/caldempsey/docker-notebook-spark-s3

Steps to reproduce

```
1. Create a Spark Cluster (locally) and connect a Jupyter Notebook to the Cluster.
2. Write a Delta Table to the Spark filesystem (lets say `/out`)
3. Spark will write the Parquet files to `/out` but error unless the notebook is given access to `/out` despite being registered as an application and not a worker.
```

Via the repo provided:
```
1. Clone the repo
2. Remove [infra-delta-lake/localhost/docker-compose.yml:63](https://github.com/caldempsey/docker-notebook-spark-s3/blob/d17f7963437215346a04450544b85770e3c5ed8f...) `./../../notebook-data-lake/data:/data`, which prevents the notebook from accessing the `/data` target shared with the Spark Master and Workers on their local filesystem.
```

Observed results

When the notebook has access to /data (but is a connected application not a member of the cluster), Delta Tables write successfully with _delta_log.

```
Py4JJavaError: An error occurred while calling o56.save.
: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log
```

When the notebook does not have access to /data it complains that it can't write _delta_log, but parquet files still get written!

TLDR

I expect the `_delta_log` to be written regardless of whether the Notebook has access to the target filesystem. This is not the case. I can see 2 reasons in my mind why this might happen.

1. The Notebook is being used as part of the workers and writes are being made from the Notebook too.

2. There's a bug in Delta Lake's latest version where the PySpark callsite needs to have access to the data the Spark Cluster is writing to in order to complet a write of `_delta_log`

Both of these don't really make sense. I've checked 1. and the Notebook looks to be registered as an application. Can anyone help?