<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Running Spark Tests in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/running-spark-tests/m-p/152744#M53866</link>
    <description>&lt;P&gt;Hello Community!&lt;BR /&gt;&lt;BR /&gt;writing to you with the question about what are the best way to run spark unit tests in databricks. Currently we have a set of notebooks which are responsible for doing the operations on the data (joins, merging etc.).&lt;BR /&gt;Of course to do not keep everything in the notebooks we have separate directory for python functions. Very often they use spark code. For now the only way to to test those functions is to mock Spark but this is not how we would like to keep it since we mocking the outputs so real Spark behavior is skipped.&amp;nbsp;The problem is that to run spark tests we need databricks environment. Maybe another important info is that we use serverless for the calculations since we do not want to wait for the cluster to wake up.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Do you have any suggestions how to write some nice spark tests and be able to run them in Databricks env?&lt;BR /&gt;&lt;BR /&gt;Thanks a lot!&lt;/P&gt;</description>
    <pubDate>Tue, 31 Mar 2026 14:46:56 GMT</pubDate>
    <dc:creator>maikel</dc:creator>
    <dc:date>2026-03-31T14:46:56Z</dc:date>
    <item>
      <title>Running Spark Tests</title>
      <link>https://community.databricks.com/t5/data-engineering/running-spark-tests/m-p/152744#M53866</link>
      <description>&lt;P&gt;Hello Community!&lt;BR /&gt;&lt;BR /&gt;writing to you with the question about what are the best way to run spark unit tests in databricks. Currently we have a set of notebooks which are responsible for doing the operations on the data (joins, merging etc.).&lt;BR /&gt;Of course to do not keep everything in the notebooks we have separate directory for python functions. Very often they use spark code. For now the only way to to test those functions is to mock Spark but this is not how we would like to keep it since we mocking the outputs so real Spark behavior is skipped.&amp;nbsp;The problem is that to run spark tests we need databricks environment. Maybe another important info is that we use serverless for the calculations since we do not want to wait for the cluster to wake up.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Do you have any suggestions how to write some nice spark tests and be able to run them in Databricks env?&lt;BR /&gt;&lt;BR /&gt;Thanks a lot!&lt;/P&gt;</description>
      <pubDate>Tue, 31 Mar 2026 14:46:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/running-spark-tests/m-p/152744#M53866</guid>
      <dc:creator>maikel</dc:creator>
      <dc:date>2026-03-31T14:46:56Z</dc:date>
    </item>
    <item>
      <title>Re: Running Spark Tests</title>
      <link>https://community.databricks.com/t5/data-engineering/running-spark-tests/m-p/152751#M53868</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/192995"&gt;@maikel&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. Databricks Connect (Best fit for your situation)&lt;BR /&gt;This is likely your best path. It lets you run Spark code locally or in CI against a real Databricks cluster/serverless compute, meaning:&lt;/P&gt;&lt;P&gt;&amp;nbsp;- Real Spark behavior, no mocking&lt;BR /&gt;&amp;nbsp;- Tests run from your local machine or CI pipeline (GitHub Actions, Azure DevOps, etc.)&lt;BR /&gt;&amp;nbsp;- You write standard pytest&amp;nbsp;tests&lt;BR /&gt;&amp;nbsp;- Serverless compute is supported as of Databricks Connect v2 (DBR 13+)&lt;BR /&gt;Your code and tests run locally, but all actual Spark execution happens on Databricks. No mocking, real Delta, real Unity Catalog.&lt;BR /&gt;Before writing any tests, verify your connection works:&lt;/P&gt;&lt;P&gt;2. Nutter (Databricks-native notebook testing)&lt;BR /&gt;If your logic is tightly coupled to notebooks, Nutter is a framework by Microsoft specifically for testing Databricks notebooks. It runs notebooks as tests inside the Databricks environment.&lt;BR /&gt;Good if you want to test notebook-level behavior, but less clean for pure function unit tests.&lt;/P&gt;&lt;P&gt;Regards,&lt;BR /&gt;LR&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 31 Mar 2026 15:23:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/running-spark-tests/m-p/152751#M53868</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2026-03-31T15:23:15Z</dc:date>
    </item>
    <item>
      <title>Re: Running Spark Tests</title>
      <link>https://community.databricks.com/t5/data-engineering/running-spark-tests/m-p/152902#M53892</link>
      <description>&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;Great suggestions&amp;nbsp;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/24053"&gt;@lingareddy_Alva&lt;/a&gt;&amp;nbsp; regarding Databricks Connect v2!&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/192995"&gt;@maikel&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;A few things to layer on top of that.&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;First, the fact that you already have your functions in a separate directory outside of notebooks is exactly the right foundation. That separation is what makes real testing possible, so you're ahead of a lot of teams on this.&lt;/P&gt;
&lt;OL class="[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3"&gt;
&lt;LI class="whitespace-normal break-words pl-2"&gt;Separate pure Python from Spark-dependent code&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;This is the highest-leverage move you can make. For your Python functions directory, look at which functions actually need Spark and which don't. Functions that build filter conditions, transform column names, assemble config, or operate on plain Python types can be tested with plain pytest, no Databricks needed at all. If some of your functions use Polars (your title mentions it), those can also be tested entirely locally.&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;Then have thin adapter functions that translate to/from DataFrames and call that pure logic. This lets you run fast local unit tests for the bulk of your logic and reserve Databricks-backed tests for the parts that truly depend on Spark (joins on DataFrames, Delta reads/writes, Unity Catalog integration).&lt;/P&gt;
&lt;OL class="[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3" start="2"&gt;
&lt;LI class="whitespace-normal break-words pl-2"&gt;Databricks Connect v2 + pytest for Spark integration tests&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;This is the core recommendation, and it fits your serverless constraint well. With Databricks Connect v2 (DBR 13+), your tests run locally or in CI, but all Spark execution happens on Databricks serverless. Real optimizer, real shuffles, real Delta, real Unity Catalog, no mocking.&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;A common pattern is a session-scoped pytest fixture in conftest.py:&lt;/P&gt;
&lt;DIV class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg focus:outline-none focus-visible:ring-2 focus-visible:ring-accent-100" tabindex="0" role="group" aria-label="Code"&gt;
&lt;DIV class="sticky opacity-0 group-hover/copy:opacity-100 group-focus-within/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"&gt;
&lt;DIV class="absolute right-0 h-8 px-2 items-center inline-flex z-10"&gt;
&lt;DIV class="relative"&gt;
&lt;DIV class="transition-all opacity-100 scale-100"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="absolute inset-0 flex items-center justify-center"&gt;
&lt;DIV class="transition-all opacity-0 scale-50"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="overflow-x-auto"&gt;
&lt;PRE class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5"&gt;&lt;CODE&gt;&lt;SPAN&gt;# conftest.py
&lt;/SPAN&gt;&lt;SPAN&gt;import pytest
&lt;/SPAN&gt;&lt;SPAN&gt;from pyspark.sql import SparkSession
&lt;/SPAN&gt;
&lt;SPAN&gt;@pytest.fixture(scope="session")
&lt;/SPAN&gt;&lt;SPAN&gt;def spark():
&lt;/SPAN&gt;&lt;SPAN&gt;    spark = SparkSession.builder.getOrCreate()
&lt;/SPAN&gt;&lt;SPAN&gt;    yield spark
&lt;/SPAN&gt;&lt;SPAN&gt;    spark.stop()&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;Then your tests create small real DataFrames and assert on outputs:&lt;/P&gt;
&lt;DIV class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg focus:outline-none focus-visible:ring-2 focus-visible:ring-accent-100" tabindex="0" role="group" aria-label="Code"&gt;
&lt;DIV class="sticky opacity-0 group-hover/copy:opacity-100 group-focus-within/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"&gt;
&lt;DIV class="absolute right-0 h-8 px-2 items-center inline-flex z-10"&gt;
&lt;DIV class="relative"&gt;
&lt;DIV class="transition-all opacity-100 scale-100"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="absolute inset-0 flex items-center justify-center"&gt;
&lt;DIV class="transition-all opacity-0 scale-50"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="overflow-x-auto"&gt;
&lt;PRE class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5"&gt;&lt;CODE&gt;&lt;SPAN&gt;def test_join_logic(spark):
&lt;/SPAN&gt;&lt;SPAN&gt;    left = spark.createDataFrame([("a", 1), ("b", 2)], ["key", "val"])
&lt;/SPAN&gt;&lt;SPAN&gt;    right = spark.createDataFrame([("a", 10)], ["key", "val2"])
&lt;/SPAN&gt;
&lt;SPAN&gt;    result = my_join_fn(left, right)
&lt;/SPAN&gt;
&lt;SPAN&gt;    rows = {tuple(r) for r in result.collect()}
&lt;/SPAN&gt;&lt;SPAN&gt;    assert rows == {("a", 1, 10), ("b", 2, None)}&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;OL class="[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3" start="3"&gt;
&lt;LI class="whitespace-normal break-words pl-2"&gt;Running tests inside Databricks&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;Two paths here depending on your setup:&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;Path A (lower barrier): If your code and tests live in Databricks Repos, you can run pytest directly on compute. Use %pip install pytest in a notebook, then !python -m pytest. This gives you real Spark on serverless with zero local setup.&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;Path B (CI/CD ready): Package your Python code and tests as a wheel. Create a Job on serverless compute that runs pytest as the entry point. Wire this into Databricks Asset Bundles (DABs) so your CI pipeline can deploy the wheel, trigger the test job, and gate promotion on test results. This is the more production-grade path and gives you repeatable "run all Spark tests in Databricks env" as part of your pipeline.&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;&amp;nbsp;&lt;/P&gt;
&lt;OL class="[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3" start="4"&gt;
&lt;LI class="whitespace-normal break-words pl-2"&gt;Where Nutter fits&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;Use Nutter only when you need to test whole notebooks: widgets, dbutils calls, orchestration between cells. For testing the Python functions in your separate directory, pytest + Databricks Connect is the cleaner path.&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;The ideal end state for your notebooks is thin orchestration layers that read parameters, load inputs, call your tested functions, and write outputs. The heavy logic lives in your Python modules where pytest can reach it.&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;&amp;nbsp;&lt;/P&gt;
&lt;OL class="[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3" start="5"&gt;
&lt;LI class="whitespace-normal break-words pl-2"&gt;Putting it together&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;A setup that fits your constraints (serverless, real Spark, minimal mocking):&lt;/P&gt;
&lt;UL class="[li_&amp;amp;]:mb-0 [li_&amp;amp;]:mt-1 [li_&amp;amp;]:gap-1 [&amp;amp;:not(:last-child)_ul]:pb-1 [&amp;amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3"&gt;
&lt;LI class="whitespace-normal break-words pl-2"&gt;Unit tests (fast, local): pure Python logic and any Polars functions. Plain pytest, no Databricks.&lt;/LI&gt;
&lt;LI class="whitespace-normal break-words pl-2"&gt;Spark integration tests (local runner pointing to Databricks serverless): pytest + Databricks Connect, creating small real DataFrames.&lt;/LI&gt;
&lt;LI class="whitespace-normal break-words pl-2"&gt;Optional notebook E2E tests (in workspace): Nutter on serverless Jobs compute for the few cases where you need to test notebook-level behavior.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;This keeps test feedback fast, stays close to real Spark behavior, and fully leverages the serverless environment you already prefer.&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-normal leading-[1.7]"&gt;Hope that helps, Maikel.&lt;/P&gt;
&lt;P class="font-claude-response-body break-words whitespace-pre-wrap leading-[1.7]"&gt;Cheers, Lou&lt;/P&gt;</description>
      <pubDate>Wed, 01 Apr 2026 12:40:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/running-spark-tests/m-p/152902#M53892</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2026-04-01T12:40:19Z</dc:date>
    </item>
  </channel>
</rss>

