<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Delta Lake Spark fails to write _delta_log via a Notebook without granting the Notebook data access in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/delta-lake-spark-fails-to-write-delta-log-via-a-notebook-without/m-p/62481#M31978</link>
    <description>&lt;P&gt;I have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table.I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location when writing directly to a filesystem.&lt;/P&gt;&lt;P&gt;This behaviour seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data. I've put up a repository with the issue for clarity. &lt;A href="https://github.com/caldempsey/docker-notebook-spark-s3" target="_blank"&gt;https://github.com/caldempsey/docker-notebook-spark-s3&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Steps to reproduce&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;```&lt;BR /&gt;1. Create a Spark Cluster (locally) and connect a Jupyter Notebook to the Cluster.&lt;BR /&gt;2. Write a Delta Table to the Spark filesystem (lets say `/out`)&lt;BR /&gt;3. Spark will write the Parquet files to `/out` but error unless the notebook is given access to `/out` despite being registered as an application and not a worker.&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Via the repo provided:&lt;/STRONG&gt;&lt;BR /&gt;```&lt;BR /&gt;1. Clone the repo&lt;BR /&gt;2. Remove [infra-delta-lake/localhost/docker-compose.yml:63](&lt;A href="https://github.com/caldempsey/docker-notebook-spark-s3/blob/d17f7963437215346a04450544b85770e3c5ed8f/infra-data-lake/localhost/docker-compose.yml#L62C1-L62C46" target="_blank"&gt;https://github.com/caldempsey/docker-notebook-spark-s3/blob/d17f7963437215346a04450544b85770e3c5ed8f/infra-data-lake/localhost/docker-compose.yml#L62C1-L62C46&lt;/A&gt;) `./../../notebook-data-lake/data:/data`, which prevents the notebook from accessing the `/data` target shared with the Spark Master and Workers on their local filesystem.&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Observed results&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;When the notebook has access to /data (but is a connected application not a member of the cluster), Delta Tables write successfully with _delta_log.&lt;/P&gt;&lt;P&gt;```&lt;BR /&gt;Py4JJavaError: An error occurred while calling o56.save.&lt;BR /&gt;: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;When the notebook does not have access to /data it complains that it can't write _delta_log, but parquet files still get written!&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;TLDR&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I expect the `_delta_log` to be written regardless of whether the Notebook has access to the target filesystem. This is not the case. I can see 2 reasons in my mind why this might happen.&lt;/P&gt;&lt;P&gt;1. The Notebook is being used as part of the workers and writes are being made from the Notebook too.&lt;/P&gt;&lt;P&gt;2. There's a bug in Delta Lake's latest version where the PySpark callsite needs to have access to the data the Spark Cluster is writing to in order to complet a write of `_delta_log`&lt;/P&gt;&lt;P&gt;Both of these don't really make sense. I've checked 1. and the Notebook looks to be registered as an application. Can anyone help?&lt;/P&gt;</description>
    <pubDate>Sat, 02 Mar 2024 12:14:14 GMT</pubDate>
    <dc:creator>caldempsey</dc:creator>
    <dc:date>2024-03-02T12:14:14Z</dc:date>
    <item>
      <title>Delta Lake Spark fails to write _delta_log via a Notebook without granting the Notebook data access</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-lake-spark-fails-to-write-delta-log-via-a-notebook-without/m-p/62481#M31978</link>
      <description>&lt;P&gt;I have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table.I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location when writing directly to a filesystem.&lt;/P&gt;&lt;P&gt;This behaviour seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data. I've put up a repository with the issue for clarity. &lt;A href="https://github.com/caldempsey/docker-notebook-spark-s3" target="_blank"&gt;https://github.com/caldempsey/docker-notebook-spark-s3&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Steps to reproduce&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;```&lt;BR /&gt;1. Create a Spark Cluster (locally) and connect a Jupyter Notebook to the Cluster.&lt;BR /&gt;2. Write a Delta Table to the Spark filesystem (lets say `/out`)&lt;BR /&gt;3. Spark will write the Parquet files to `/out` but error unless the notebook is given access to `/out` despite being registered as an application and not a worker.&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Via the repo provided:&lt;/STRONG&gt;&lt;BR /&gt;```&lt;BR /&gt;1. Clone the repo&lt;BR /&gt;2. Remove [infra-delta-lake/localhost/docker-compose.yml:63](&lt;A href="https://github.com/caldempsey/docker-notebook-spark-s3/blob/d17f7963437215346a04450544b85770e3c5ed8f/infra-data-lake/localhost/docker-compose.yml#L62C1-L62C46" target="_blank"&gt;https://github.com/caldempsey/docker-notebook-spark-s3/blob/d17f7963437215346a04450544b85770e3c5ed8f/infra-data-lake/localhost/docker-compose.yml#L62C1-L62C46&lt;/A&gt;) `./../../notebook-data-lake/data:/data`, which prevents the notebook from accessing the `/data` target shared with the Spark Master and Workers on their local filesystem.&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Observed results&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;When the notebook has access to /data (but is a connected application not a member of the cluster), Delta Tables write successfully with _delta_log.&lt;/P&gt;&lt;P&gt;```&lt;BR /&gt;Py4JJavaError: An error occurred while calling o56.save.&lt;BR /&gt;: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;When the notebook does not have access to /data it complains that it can't write _delta_log, but parquet files still get written!&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;TLDR&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I expect the `_delta_log` to be written regardless of whether the Notebook has access to the target filesystem. This is not the case. I can see 2 reasons in my mind why this might happen.&lt;/P&gt;&lt;P&gt;1. The Notebook is being used as part of the workers and writes are being made from the Notebook too.&lt;/P&gt;&lt;P&gt;2. There's a bug in Delta Lake's latest version where the PySpark callsite needs to have access to the data the Spark Cluster is writing to in order to complet a write of `_delta_log`&lt;/P&gt;&lt;P&gt;Both of these don't really make sense. I've checked 1. and the Notebook looks to be registered as an application. Can anyone help?&lt;/P&gt;</description>
      <pubDate>Sat, 02 Mar 2024 12:14:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-lake-spark-fails-to-write-delta-log-via-a-notebook-without/m-p/62481#M31978</guid>
      <dc:creator>caldempsey</dc:creator>
      <dc:date>2024-03-02T12:14:14Z</dc:date>
    </item>
  </channel>
</rss>

