<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Testing Spark Declarative Pipeline in Docker Container &amp;gt; PySparkRuntimeError in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/144066#M52254</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/96188"&gt;@ChristianRRL&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;In addition to&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/181478"&gt;@osingh&lt;/a&gt;&amp;nbsp;'s answers, check out this old but good blog post about how to structure the pipelines's code to enable dev and test cycle:&amp;nbsp;&lt;A href="https://www.databricks.com/blog/applying-software-development-devops-best-practices-delta-live-table-pipelines" target="_blank"&gt;https://www.databricks.com/blog/applying-software-development-devops-best-practices-delta-live-table-pipelines&lt;/A&gt;&amp;nbsp;(section "Structuring the DLT pipeline's code" and "Implementing unit tests")&lt;/P&gt;&lt;P&gt;Best regards,&lt;/P&gt;</description>
    <pubDate>Wed, 14 Jan 2026 16:27:58 GMT</pubDate>
    <dc:creator>aleksandra_ch</dc:creator>
    <dc:date>2026-01-14T16:27:58Z</dc:date>
    <item>
      <title>Testing Spark Declarative Pipeline in Docker Container &gt; PySparkRuntimeError</title>
      <link>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/143959#M52244</link>
      <description>&lt;P&gt;Hi there, I see via an &lt;A href="https://www.databricks.com/blog/bringing-declarative-pipelines-apache-spark-open-source-project" target="_self"&gt;announcement&lt;/A&gt;&amp;nbsp;last year that Spark Declarative Pipeline (previously DLT) was getting open sourced into Apache Spark, and I see that this recently is true as of Apache 4.1:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;A href="https://spark.apache.org/docs/4.1.0/declarative-pipelines-programming-guide.html" target="_self"&gt;Spark Declarative Pipelines Programming Guide&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I'm trying to test this out on a docker container, just to see if/how it's possible to use SDP as a fully standalone tool and help ease vendor lock-in concerns. However, outside of building a SDP pipeline in Databricks, I'm not sure how I'd go about doing this with the open source version. For context, here is the dockerfile I'm currently using to get the latest version of apache spark (4.1.1 at this time).&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;FROM apache/spark:latest

# Switch to root to install packages
USER root

# Install pyspark and findspark
RUN pip install --no-cache-dir \
    jupyter \
    ipykernel \
    findspark \
    pyspark

# 3. Register the ipykernel
# This ensures "Python 3" is available as a kernel option in the UI
RUN python3 -m ipykernel install --user

ENV SPARK_HOME=/opt/spark
ENV PATH=$SPARK_HOME/bin:$PATH

# Switch back to spark user if desired, or stay root for VS Code access
USER root 

WORKDIR /opt/spark/app

CMD ["/bin/bash"]&lt;/LI-CODE&gt;&lt;P&gt;I'm able to import the pipelines library in pyspark, but when I attempt to use it I quickly get an error:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ChristianRRL_0-1768361209159.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/22961i0655B07219D48256/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ChristianRRL_0-1768361209159.png" alt="ChristianRRL_0-1768361209159.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;ERROR MESSAGE:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class=""&gt;PySparkRuntimeError&lt;/SPAN&gt;&lt;SPAN class=""&gt;: [GRAPH_ELEMENT_DEFINED_OUTSIDE_OF_DECLARATIVE_PIPELINE] APIs that define elements of a declarative pipeline can only be invoked within the context of defining a pipeline.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Any help would be much appreciated to clarify what might be the issue here!&lt;/P&gt;</description>
      <pubDate>Wed, 14 Jan 2026 03:32:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/143959#M52244</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2026-01-14T03:32:04Z</dc:date>
    </item>
    <item>
      <title>Re: Testing Spark Declarative Pipeline in Docker Container &gt; PySparkRuntimeError</title>
      <link>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/144054#M52250</link>
      <description>&lt;P&gt;I think this error is something a lot of people hit when moving from regular PySpark to Spark Declarative Pipelines in &lt;STRONG&gt;Spark 4.1.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I believe the main reason it shows up is because SDP doesn’t work like normal PySpark where you can run things cell by cell in a notebook. It’s declarative in nature. You define the entire pipeline first, and then the SDP runtime comes in and executes the whole graph for you.&lt;/P&gt;&lt;P&gt;You can try the steps below that may help to resolve your issues.&lt;/P&gt;&lt;P&gt;Solution: Moving from Interactive to Pipeline Mode&lt;BR /&gt;The error GRAPH_ELEMENT_DEFINED_OUTSIDE_OF_DECLARATIVE_PIPELINE happens because the @sdp.table decorators are trying to register themselves into a "Pipeline Context" that doesn't exist in a standard pyspark session.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Step 1: Update your Dockerfile&lt;/STRONG&gt;&lt;BR /&gt;You need the specific CLI tools for pipelines. Change your pip install line to include the pipelines extra:&lt;/P&gt;&lt;P&gt;To fix this in your Docker environment, you need to change how you execute the code.&lt;/P&gt;&lt;H4&gt;&lt;STRONG&gt;Step 2: Use the Pipeline Structure&lt;/STRONG&gt;&lt;/H4&gt;&lt;P&gt;Instead of running a script directly, create a small project structure. SDP requires a YAML file to tell Spark where your code is.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;1. Create a spark-pipeline.yml file:&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;name: "my_docker_pipeline"&lt;BR /&gt;storage: "/tmp/checkpoints" # Required for streaming state&lt;BR /&gt;libraries:&lt;BR /&gt;- glob: "transformations/*.py"&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;2. Put your logic in transformations/my_task.py:&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;from pyspark import pipelines as sdp&lt;BR /&gt;&lt;BR /&gt;@sdp.table(name="raw_data")&lt;BR /&gt;def raw_data():&lt;BR /&gt;return spark.read.format("csv").load("/opt/spark/app/data.csv")&lt;BR /&gt;&lt;BR /&gt;@sdp.materialized_view(name="clean_data")&lt;BR /&gt;def clean_data():&lt;BR /&gt;return sdp.read("raw_data").filter("id IS NOT NULL")&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;Step 3: Run using the CLI&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Inside your container, don't use python my_script.py. Use the spark-pipelines command that comes with the package:&lt;/P&gt;&lt;PRE&gt;spark-pipelines run --spec spark-pipeline.yml&lt;/PRE&gt;&lt;P&gt;Why this works?&lt;BR /&gt;When you use spark-pipelines run, the tool initializes the &lt;EM&gt;"Declarative Context"&lt;/EM&gt; first. It then scans your files, builds a dependency graph of your &lt;EM&gt;@sdp&lt;/EM&gt; decorators, and handles the &lt;EM&gt;spark.read&lt;/EM&gt; and &lt;EM&gt;.write&lt;/EM&gt; operations automatically.&lt;/P&gt;&lt;P&gt;Hope this should work!&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":crossed_fingers:"&gt;🤞&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Thanks!&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 14 Jan 2026 15:24:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/144054#M52250</guid>
      <dc:creator>osingh</dc:creator>
      <dc:date>2026-01-14T15:24:09Z</dc:date>
    </item>
    <item>
      <title>Re: Testing Spark Declarative Pipeline in Docker Container &gt; PySparkRuntimeError</title>
      <link>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/144059#M52251</link>
      <description>&lt;P&gt;This is great info! Couple of quick follow-up questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Can I get some assistance with identifying what CLI tools would need to be included in the dockerfile? I see you put the following comment, not sure if you meant to highlight the CLI tools, but it seems cut off&lt;/LI&gt;&lt;/UL&gt;&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/181478"&gt;@osingh&lt;/a&gt;&amp;nbsp;wrote:&lt;P&gt;&lt;STRONG&gt;Step 1: Update your Dockerfile&lt;/STRONG&gt;&lt;BR /&gt;You need the specific CLI tools for pipelines. Change your pip install line to include the pipelines extra:&lt;/P&gt;&lt;P&gt;To fix this in your Docker environment, you need to change how you execute the code.&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;UL&gt;&lt;LI&gt;Also, I understand that due to the declarative nature of SDP, the current setup will not run cell-by-cell. But I'm wondering, how would this normally be tested if cell-by-cell is not testable? Within the SDP context, how would devs commonly test SDP functionality without needing to write the entire pipeline and pray that it works (I'm sure this shouldn't be the case).&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Wed, 14 Jan 2026 15:40:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/144059#M52251</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2026-01-14T15:40:22Z</dc:date>
    </item>
    <item>
      <title>Re: Testing Spark Declarative Pipeline in Docker Container &gt; PySparkRuntimeError</title>
      <link>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/144063#M52253</link>
      <description>&lt;P&gt;You just need the &lt;FONT face="courier new,courier"&gt;pipeline&lt;/FONT&gt; CLI that comes with PySpark itself.&lt;/P&gt;&lt;P&gt;In the Dockerfile, install PySpark with the pipelines extra:&lt;/P&gt;&lt;PRE&gt;RUN pip install --no-cache-dir "pyspark[pipelines]"&lt;/PRE&gt;&lt;P&gt;This installs the &lt;FONT face="courier new,courier"&gt;spark-pipelines&lt;/FONT&gt; CLI, which is required to run Spark Declarative Pipelines. Without this, SDP code won’t work, even if PySpark is already installed.&lt;/P&gt;&lt;P&gt;Reply to the second ask:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;You can use the "Logic vs. Wrapper" Pattern&lt;/STRONG&gt;&lt;BR /&gt;The most effective way to test is to keep your actual business logic (the transformations) in standard PySpark functions that have no decorators attached to them.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Step 1&lt;/STRONG&gt;: Write a &lt;FONT face="courier new,courier"&gt;"pure"&lt;/FONT&gt; function in a separate &lt;FONT face="courier new,courier"&gt;&lt;STRONG&gt;.py&lt;/STRONG&gt;&lt;/FONT&gt; file that takes a DataFrame and returns a DataFrame.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Step 2&lt;/STRONG&gt;: Test that function in a standard Jupyter cell or a pytest script using a small, manually created dummy DataFrame.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Step 3&lt;/STRONG&gt;: Once you know the logic works, simply wrap it in an @&lt;FONT face="courier new,courier"&gt;sdp.table&lt;/FONT&gt; function in your pipeline file.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;You can also use the &lt;FONT face="courier new,courier"&gt;dry-run&lt;/FONT&gt; Command&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 14 Jan 2026 16:16:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/144063#M52253</guid>
      <dc:creator>osingh</dc:creator>
      <dc:date>2026-01-14T16:16:39Z</dc:date>
    </item>
    <item>
      <title>Re: Testing Spark Declarative Pipeline in Docker Container &gt; PySparkRuntimeError</title>
      <link>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/144066#M52254</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/96188"&gt;@ChristianRRL&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;In addition to&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/181478"&gt;@osingh&lt;/a&gt;&amp;nbsp;'s answers, check out this old but good blog post about how to structure the pipelines's code to enable dev and test cycle:&amp;nbsp;&lt;A href="https://www.databricks.com/blog/applying-software-development-devops-best-practices-delta-live-table-pipelines" target="_blank"&gt;https://www.databricks.com/blog/applying-software-development-devops-best-practices-delta-live-table-pipelines&lt;/A&gt;&amp;nbsp;(section "Structuring the DLT pipeline's code" and "Implementing unit tests")&lt;/P&gt;&lt;P&gt;Best regards,&lt;/P&gt;</description>
      <pubDate>Wed, 14 Jan 2026 16:27:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/testing-spark-declarative-pipeline-in-docker-container-gt/m-p/144066#M52254</guid>
      <dc:creator>aleksandra_ch</dc:creator>
      <dc:date>2026-01-14T16:27:58Z</dc:date>
    </item>
  </channel>
</rss>

