cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Lake Spark fails to write _delta_log via a Notebook without granting the Notebook data access

caldempsey
New Contributor

I have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table.I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location when writing directly to a filesystem.

This behaviour seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data. I've put up a repository with the issue for clarity. https://github.com/caldempsey/docker-notebook-spark-s3


Steps to reproduce

```
1. Create a Spark Cluster (locally) and connect a Jupyter Notebook to the Cluster.
2. Write a Delta Table to the Spark filesystem (lets say `/out`)
3. Spark will write the Parquet files to `/out` but error unless the notebook is given access to `/out` despite being registered as an application and not a worker.
```


Via the repo provided:
```
1. Clone the repo
2. Remove [infra-delta-lake/localhost/docker-compose.yml:63](https://github.com/caldempsey/docker-notebook-spark-s3/blob/d17f7963437215346a04450544b85770e3c5ed8f...) `./../../notebook-data-lake/data:/data`, which prevents the notebook from accessing the `/data` target shared with the Spark Master and Workers on their local filesystem.
```

Observed results

When the notebook has access to /data (but is a connected application not a member of the cluster), Delta Tables write successfully with _delta_log.

```
Py4JJavaError: An error occurred while calling o56.save.
: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log
```

When the notebook does not have access to /data it complains that it can't write _delta_log, but parquet files still get written!

TLDR

I expect the `_delta_log` to be written regardless of whether the Notebook has access to the target filesystem. This is not the case. I can see 2 reasons in my mind why this might happen.

1. The Notebook is being used as part of the workers and writes are being made from the Notebook too.

2. There's a bug in Delta Lake's latest version where the PySpark callsite needs to have access to the data the Spark Cluster is writing to in order to complet a write of `_delta_log`

Both of these don't really make sense. I've checked 1. and the Notebook looks to be registered as an application. Can anyone help?

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @caldempseyThank you for providing detailed information about your setup and the issue you’re encountering with Spark writes to a Delta table.

Let’s dive into this behavior and explore potential solutions.

  1. Access to Data Location:

    • You’ve correctly observed that when the Jupyter Notebook has access to the /data directory (even as a connected application, not a worker), Delta Tables write successfully with the _delta_log.
    • However, when the notebook lacks access to /data, it complains about not being able to write the _delta_log, even though the Parquet files are still written.
    • This behavior might seem counterintuitive, but it’s related to how Docker volumes and filesystem permissions work.
  2. Docker Volumes and Filesystem Access:

    • In your Docker Compose file, when you define a volume, Docker links that volume to a specific path within the container.
    • If you don’t set up a volume for the Jupyter notebook that aligns with the data location (in this case, /data), the notebook won’t be able to interact with that section of the filesystem.
    • So, Spark does handle the writes independently, but it needs access to the /data directory.
    • When running operations that involve both reading and writing data, both the Spark cluster and the Jupyter notebook need to have access to the same filesystem locations; otherwise, permission issues arise.
  3. Solutions:

In summary, while Spark handles writes independently, it relies on consistent filesystem access across all components involved. Addressing the volume configuration or including the Delta Lake JAR should help resolve the issue. If you have any further questions or need additional guidance, feel free to ask! 🚀

 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.