<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Using pyspark databricks UDFs with outside function imports in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/using-pyspark-databricks-udfs-with-outside-function-imports/m-p/138556#M50955</link>
    <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The core issue is that PySpark UDFs require their entire closure—including any helper functions they call, such as _get_greeting—to be serializable and available on the worker nodes. In Databricks Workspace, the module distribution and packaging are managed transparently. Locally with databricks-connect, however, the worker processes may not have access to your full Python environment or custom modules, which causes pickling or module import errors like:&lt;/P&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;ModuleNotFoundError: No module named 'module'&lt;/P&gt;
&lt;H2 id="why-this-happens" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Why This Happens&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Single-file/inline UDFs work&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;because all definitions are in the main context.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;UDFs that reference nested functions or imports&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;depend on either closure serialization (which has quirks with PySpark/CloudPickle) or require module visibility from the execution directory of the PySpark worker (often different from your local directory).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Databricks Workspace&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;handles code distribution, so helpers within the same file or package are available.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Local/Databricks-Connect&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;runs worker processes that may not have your project's structure in their PYTHONPATH, so import/module issues arise for non-trivial UDFs.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 id="recommendations-for-local-debugging" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Recommendations for Local Debugging&lt;/H2&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;1. Flatten UDF Definitions&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Avoid calling helpers inside UDFs unless you are sure the helpers will always be in the worker's PYTHONPATH.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Inline logic directly into the UDF for local testing.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;2. Package Your Code&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Package your Python code (as a .whl or .egg) and install it in your local Spark environment before running your script. Example:&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;bash&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;python setup.py bdist_wheel
pip &lt;SPAN class="token token"&gt;install&lt;/SPAN&gt; dist/your_package.whl
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Start your Spark app only after your local cluster (or databricks-connect environment) has the same package installed, ensuring all workers find your module.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;3. Use Spark's addPyFile&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;You can distribute Python files or zip archives to Spark workers with:&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;spark&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;sparkContext&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;addPyFile&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"path/to/your/module.zip"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;This works for small, self-contained modules.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;4. Use pandas_udf for Easier Debugging&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Often, pandas_udf (vectorized UDFs) are easier to debug and more performant, and they behave more like regular Python functions.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;5. Conditional Logic for Local Mode&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Use a local-mode fallback: implement logic to bypass UDFs for local or unit testing:&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;if&lt;/SPAN&gt; local_use&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
    df &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; df&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;withColumn&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"message"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; some_non_udf_function&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"name"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;SPAN class="token token"&gt;else&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
    df &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; df&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;withColumn&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"message"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; send_complex_message&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"name"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;However, this sacrifices fidelity with production code.&lt;/P&gt;
&lt;H2 id="table-udf-best-practices-for-local-debugging" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Table: UDF Best Practices for Local Debugging&lt;/H2&gt;
&lt;DIV class="group relative"&gt;
&lt;DIV class="w-full overflow-x-auto md:max-w-[90vw] border-subtlest ring-subtlest divide-subtlest bg-transparent"&gt;
&lt;TABLE class="border-subtler my-[1em] w-full table-auto border-separate border-spacing-0 border-l border-t"&gt;
&lt;THEAD class="bg-subtler"&gt;
&lt;TR&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Pattern&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Works in Databricks Workspace&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Works with databricks-connect Locally&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Notes&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Single-file/Inline UDF&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;All logic is visible to workers&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;UDF with helper import&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;No (unless packaged/distributed)&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Workers may miss helper functions&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Packaged module UDF&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes (if installed in env)&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Package must be visible to Spark workers&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;addPyFile UDF&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Useful for small scripts/modules&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Conditional local mode&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;N/A&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Non-UDF path only for local debugging&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;/DIV&gt;
&lt;DIV class="bg-base border-subtler shadow-subtle pointer-coarse:opacity-100 right-xs absolute bottom-0 flex rounded-lg border opacity-0 transition-opacity group-hover:opacity-100 [&amp;amp;&amp;gt;*:not(:first-child)]:border-subtle [&amp;amp;&amp;gt;*:not(:first-child)]:border-l"&gt;
&lt;DIV class="flex"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="flex"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;H2 id="key-takeaway" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Key Takeaway&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;To reliably debug PySpark UDFs locally, package all helper functions/methods into installable modules and ensure your Spark worker (whether local or remote) has access to that package. Otherwise, keep UDFs purely self-contained.&lt;/P&gt;
&lt;HR /&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For Databricks official references and troubleshooting serialization/UDF local development, see the following resources:&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;A class="reset interactable cursor-pointer decoration-1 underline-offset-1 text-super hover:underline font-semibold" href="https://docs.databricks.com/en/dev-tools/databricks-connect.html#limitations" target="_blank" rel="nofollow noopener"&gt;&lt;SPAN class="text-box-trim-both"&gt;Databricks UDF Troubleshooting&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;A class="reset interactable cursor-pointer decoration-1 underline-offset-1 text-super hover:underline font-semibold" href="https://spark.apache.org/docs/latest/api/python/user_guide/python-packaging.html" target="_blank" rel="nofollow noopener"&gt;&lt;SPAN class="text-box-trim-both"&gt;PySpark: How to Use addPyFile&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Tue, 11 Nov 2025 10:36:34 GMT</pubDate>
    <dc:creator>mark_ott</dc:creator>
    <dc:date>2025-11-11T10:36:34Z</dc:date>
    <item>
      <title>Using pyspark databricks UDFs with outside function imports</title>
      <link>https://community.databricks.com/t5/data-engineering/using-pyspark-databricks-udfs-with-outside-function-imports/m-p/111865#M44017</link>
      <description>&lt;DIV class=""&gt;&lt;H2&gt;Problem with minimal example&lt;/H2&gt;&lt;P&gt;The below minimal example does not run locally with databricks-connect==15.3 but does run within databricks workspace.&lt;/P&gt;&lt;P&gt;main.py&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;from&lt;/SPAN&gt; databricks.connect &lt;SPAN class=""&gt;import&lt;/SPAN&gt; DatabricksSession

&lt;SPAN class=""&gt;from&lt;/SPAN&gt; module.udf &lt;SPAN class=""&gt;import&lt;/SPAN&gt; send_message, send_complex_message

spark = DatabricksSession.builder.getOrCreate()


&lt;SPAN class=""&gt;def&lt;/SPAN&gt; &lt;SPAN class=""&gt;main_works&lt;/SPAN&gt;():
    df = spark.createDataFrame([
        (&lt;SPAN class=""&gt;1&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"James"&lt;/SPAN&gt;),
        (&lt;SPAN class=""&gt;2&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"John"&lt;/SPAN&gt;),
    ], [&lt;SPAN class=""&gt;"id"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"name"&lt;/SPAN&gt;])

    df = df.withColumn(&lt;SPAN class=""&gt;"message"&lt;/SPAN&gt;, send_message(&lt;SPAN class=""&gt;"name"&lt;/SPAN&gt;))

    df.show()


&lt;SPAN class=""&gt;def&lt;/SPAN&gt; &lt;SPAN class=""&gt;main_fails&lt;/SPAN&gt;():
    df = spark.createDataFrame([
        (&lt;SPAN class=""&gt;1&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"James"&lt;/SPAN&gt;),
        (&lt;SPAN class=""&gt;2&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"John"&lt;/SPAN&gt;),
    ], [&lt;SPAN class=""&gt;"id"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"name"&lt;/SPAN&gt;])

    df = df.withColumn(&lt;SPAN class=""&gt;"message"&lt;/SPAN&gt;, send_complex_message(&lt;SPAN class=""&gt;"name"&lt;/SPAN&gt;))

    df.show()

&lt;SPAN class=""&gt;if&lt;/SPAN&gt; __name__ == &lt;SPAN class=""&gt;"__main__"&lt;/SPAN&gt;:
    main_works()
    main_fails()&lt;/PRE&gt;&lt;P&gt;udf.py&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;from&lt;/SPAN&gt; pyspark.sql.functions &lt;SPAN class=""&gt;import&lt;/SPAN&gt; udf
&lt;SPAN class=""&gt;from&lt;/SPAN&gt; pyspark.sql.types &lt;SPAN class=""&gt;import&lt;/SPAN&gt; StringType

GREETING = &lt;SPAN class=""&gt;"Hi"&lt;/SPAN&gt;
COWBOY_GREETING = &lt;SPAN class=""&gt;"Howdy"&lt;/SPAN&gt;

&lt;SPAN class=""&gt;def&lt;/SPAN&gt; &lt;SPAN class=""&gt;_get_greeting&lt;/SPAN&gt;() -&amp;gt; &lt;SPAN class=""&gt;str&lt;/SPAN&gt;:
    &lt;SPAN class=""&gt;return&lt;/SPAN&gt; COWBOY_GREETING

&lt;SPAN class=""&gt;@udf(&lt;SPAN class=""&gt;returnType=StringType()&lt;/SPAN&gt;)&lt;/SPAN&gt;
&lt;SPAN class=""&gt;def&lt;/SPAN&gt; &lt;SPAN class=""&gt;send_message&lt;/SPAN&gt;(&lt;SPAN class=""&gt;message: &lt;SPAN class=""&gt;str&lt;/SPAN&gt;&lt;/SPAN&gt;) -&amp;gt; &lt;SPAN class=""&gt;str&lt;/SPAN&gt;:
    &lt;SPAN class=""&gt;return&lt;/SPAN&gt; &lt;SPAN class=""&gt;f"&lt;SPAN class=""&gt;{GREETING}&lt;/SPAN&gt;, &lt;SPAN class=""&gt;{message}&lt;/SPAN&gt;!"&lt;/SPAN&gt;

&lt;SPAN class=""&gt;@udf(&lt;SPAN class=""&gt;returnType=StringType()&lt;/SPAN&gt;)&lt;/SPAN&gt;
&lt;SPAN class=""&gt;def&lt;/SPAN&gt; &lt;SPAN class=""&gt;send_complex_message&lt;/SPAN&gt;(&lt;SPAN class=""&gt;message: &lt;SPAN class=""&gt;str&lt;/SPAN&gt;&lt;/SPAN&gt;) -&amp;gt; &lt;SPAN class=""&gt;str&lt;/SPAN&gt;:
    greeting_str = _get_greeting()
    &lt;SPAN class=""&gt;return&lt;/SPAN&gt; &lt;SPAN class=""&gt;f"&lt;SPAN class=""&gt;{greeting_str}&lt;/SPAN&gt;, &lt;SPAN class=""&gt;{message}&lt;/SPAN&gt;!"&lt;/SPAN&gt;
&lt;/PRE&gt;&lt;H2&gt;Traceback when running locally&lt;/H2&gt;&lt;P&gt;The second function fails:&lt;/P&gt;&lt;PRE&gt;Traceback (most recent call last):
  File &lt;SPAN class=""&gt;"main.py"&lt;/SPAN&gt;, line &lt;SPAN class=""&gt;31&lt;/SPAN&gt;, &lt;SPAN class=""&gt;in&lt;/SPAN&gt; &amp;lt;module&amp;gt;
    main_fails()
  File &lt;SPAN class=""&gt;"main.py"&lt;/SPAN&gt;, line &lt;SPAN class=""&gt;27&lt;/SPAN&gt;, &lt;SPAN class=""&gt;in&lt;/SPAN&gt; main_fails
    df.show()
  [shortened]
  File &lt;SPAN class=""&gt;"/databricks/spark/python/pyspark/serializers.py"&lt;/SPAN&gt;, line &lt;SPAN class=""&gt;572&lt;/SPAN&gt;, &lt;SPAN class=""&gt;in&lt;/SPAN&gt; loads
    &lt;SPAN class=""&gt;return&lt;/SPAN&gt; cloudpickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named &lt;SPAN class=""&gt;'module'&lt;/SPAN&gt;
&lt;/PRE&gt;&lt;H2&gt;Expected result and question&lt;/H2&gt;&lt;P&gt;See below. How do you properly debug your Databricks workflows if they cannot allow for some UDF function nesting? It's very hard to debug them. Now we develop code for&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;local-use == True&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;which works around UDFs.&lt;/P&gt;&lt;PRE&gt;+---+-----+----------+
| &lt;SPAN class=""&gt;id&lt;/SPAN&gt;| name|   message|
+---+-----+----------+
|  &lt;SPAN class=""&gt;1&lt;/SPAN&gt;|James|Hi, James!|
|  &lt;SPAN class=""&gt;2&lt;/SPAN&gt;| John| Hi, John!|
+---+-----+----------+&lt;/PRE&gt;&lt;P&gt;and&lt;/P&gt;&lt;PRE&gt;+---+-----+-------------+
| &lt;SPAN class=""&gt;id&lt;/SPAN&gt;| name|      message|
+---+-----+-------------+
|  &lt;SPAN class=""&gt;1&lt;/SPAN&gt;|James|Howdy, James!|
|  &lt;SPAN class=""&gt;2&lt;/SPAN&gt;| John| Howdy, John!|
+---+-----+-------------+&lt;/PRE&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 05 Mar 2025 20:59:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/using-pyspark-databricks-udfs-with-outside-function-imports/m-p/111865#M44017</guid>
      <dc:creator>tp992</dc:creator>
      <dc:date>2025-03-05T20:59:27Z</dc:date>
    </item>
    <item>
      <title>Re: Using pyspark databricks UDFs with outside function imports</title>
      <link>https://community.databricks.com/t5/data-engineering/using-pyspark-databricks-udfs-with-outside-function-imports/m-p/112044#M44086</link>
      <description>&lt;P&gt;I think the solution is in .addArtifact if I read this:&lt;/P&gt;&lt;DIV class=""&gt;&lt;A href="https://kb.databricks.com/en_US/clusters/cannot-access-apache-sparkcontext-object-using-addpyfile" target="_blank" rel="noopener"&gt;https://kb.databricks.com/en_US/clusters/cannot-access-apache-sparkcontext-object-using-addpyfile&lt;/A&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;A href="https://www.databricks.com/blog/python-dependency-management-spark-connect" target="_blank" rel="noopener"&gt;https://www.databricks.com/blog/python-dependency-management-spark-connect&lt;/A&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;P&gt;But have not gotten it to work just yet.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;spark &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; DatabricksSession.builder.getOrCreate()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;venv_pack.pack(&lt;/SPAN&gt;&lt;SPAN&gt;output&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;'pyspark_venv.tar.gz'&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;spark.addArtifact(&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"pyspark_venv.tar.gz#environment"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;archive&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;spark.conf.set(&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"spark.sql.execution.pyspark.python"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"environment/bin/python"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;spark.addArtifact(&lt;/SPAN&gt;&lt;SPAN&gt;"minimal_example_1/module/udf.py"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;pyfile&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;```&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 07 Mar 2025 19:53:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/using-pyspark-databricks-udfs-with-outside-function-imports/m-p/112044#M44086</guid>
      <dc:creator>tp992</dc:creator>
      <dc:date>2025-03-07T19:53:55Z</dc:date>
    </item>
    <item>
      <title>Re: Using pyspark databricks UDFs with outside function imports</title>
      <link>https://community.databricks.com/t5/data-engineering/using-pyspark-databricks-udfs-with-outside-function-imports/m-p/138556#M50955</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The core issue is that PySpark UDFs require their entire closure—including any helper functions they call, such as _get_greeting—to be serializable and available on the worker nodes. In Databricks Workspace, the module distribution and packaging are managed transparently. Locally with databricks-connect, however, the worker processes may not have access to your full Python environment or custom modules, which causes pickling or module import errors like:&lt;/P&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;ModuleNotFoundError: No module named 'module'&lt;/P&gt;
&lt;H2 id="why-this-happens" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Why This Happens&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Single-file/inline UDFs work&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;because all definitions are in the main context.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;UDFs that reference nested functions or imports&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;depend on either closure serialization (which has quirks with PySpark/CloudPickle) or require module visibility from the execution directory of the PySpark worker (often different from your local directory).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Databricks Workspace&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;handles code distribution, so helpers within the same file or package are available.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Local/Databricks-Connect&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;runs worker processes that may not have your project's structure in their PYTHONPATH, so import/module issues arise for non-trivial UDFs.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 id="recommendations-for-local-debugging" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Recommendations for Local Debugging&lt;/H2&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;1. Flatten UDF Definitions&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Avoid calling helpers inside UDFs unless you are sure the helpers will always be in the worker's PYTHONPATH.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Inline logic directly into the UDF for local testing.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;2. Package Your Code&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Package your Python code (as a .whl or .egg) and install it in your local Spark environment before running your script. Example:&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;bash&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;python setup.py bdist_wheel
pip &lt;SPAN class="token token"&gt;install&lt;/SPAN&gt; dist/your_package.whl
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Start your Spark app only after your local cluster (or databricks-connect environment) has the same package installed, ensuring all workers find your module.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;3. Use Spark's addPyFile&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;You can distribute Python files or zip archives to Spark workers with:&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;spark&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;sparkContext&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;addPyFile&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"path/to/your/module.zip"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;This works for small, self-contained modules.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;4. Use pandas_udf for Easier Debugging&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Often, pandas_udf (vectorized UDFs) are easier to debug and more performant, and they behave more like regular Python functions.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;5. Conditional Logic for Local Mode&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Use a local-mode fallback: implement logic to bypass UDFs for local or unit testing:&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;if&lt;/SPAN&gt; local_use&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
    df &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; df&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;withColumn&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"message"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; some_non_udf_function&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"name"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;SPAN class="token token"&gt;else&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
    df &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; df&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;withColumn&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"message"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; send_complex_message&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"name"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;However, this sacrifices fidelity with production code.&lt;/P&gt;
&lt;H2 id="table-udf-best-practices-for-local-debugging" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Table: UDF Best Practices for Local Debugging&lt;/H2&gt;
&lt;DIV class="group relative"&gt;
&lt;DIV class="w-full overflow-x-auto md:max-w-[90vw] border-subtlest ring-subtlest divide-subtlest bg-transparent"&gt;
&lt;TABLE class="border-subtler my-[1em] w-full table-auto border-separate border-spacing-0 border-l border-t"&gt;
&lt;THEAD class="bg-subtler"&gt;
&lt;TR&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Pattern&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Works in Databricks Workspace&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Works with databricks-connect Locally&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Notes&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Single-file/Inline UDF&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;All logic is visible to workers&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;UDF with helper import&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;No (unless packaged/distributed)&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Workers may miss helper functions&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Packaged module UDF&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes (if installed in env)&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Package must be visible to Spark workers&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;addPyFile UDF&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Useful for small scripts/modules&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Conditional local mode&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;N/A&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Non-UDF path only for local debugging&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;/DIV&gt;
&lt;DIV class="bg-base border-subtler shadow-subtle pointer-coarse:opacity-100 right-xs absolute bottom-0 flex rounded-lg border opacity-0 transition-opacity group-hover:opacity-100 [&amp;amp;&amp;gt;*:not(:first-child)]:border-subtle [&amp;amp;&amp;gt;*:not(:first-child)]:border-l"&gt;
&lt;DIV class="flex"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="flex"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;H2 id="key-takeaway" class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0 md:text-lg [hr+&amp;amp;]:mt-4"&gt;Key Takeaway&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;To reliably debug PySpark UDFs locally, package all helper functions/methods into installable modules and ensure your Spark worker (whether local or remote) has access to that package. Otherwise, keep UDFs purely self-contained.&lt;/P&gt;
&lt;HR /&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For Databricks official references and troubleshooting serialization/UDF local development, see the following resources:&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;A class="reset interactable cursor-pointer decoration-1 underline-offset-1 text-super hover:underline font-semibold" href="https://docs.databricks.com/en/dev-tools/databricks-connect.html#limitations" target="_blank" rel="nofollow noopener"&gt;&lt;SPAN class="text-box-trim-both"&gt;Databricks UDF Troubleshooting&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;A class="reset interactable cursor-pointer decoration-1 underline-offset-1 text-super hover:underline font-semibold" href="https://spark.apache.org/docs/latest/api/python/user_guide/python-packaging.html" target="_blank" rel="nofollow noopener"&gt;&lt;SPAN class="text-box-trim-both"&gt;PySpark: How to Use addPyFile&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Tue, 11 Nov 2025 10:36:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/using-pyspark-databricks-udfs-with-outside-function-imports/m-p/138556#M50955</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-11-11T10:36:34Z</dc:date>
    </item>
  </channel>
</rss>

