<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Python udfs, Spark Connect, included modules. Compatibility issues with shared compute in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/python-udfs-spark-connect-included-modules-compatibility-issues/m-p/78167#M35482</link>
    <description>&lt;P&gt;Our current system uses Databricks notebooks and we have some shared notebooks that define some python udfs. This was working for us until we tried to switch from single user clusters to shared clusters. Shared clusters and serverless now use Spark Connect and that introduces a lot of behavior changes.&amp;nbsp; There are several ways to include resources in spark but we can't find a combination that allows the worker nodes to find the udf source code. Pulling the udf function directly into our notebook works but we want to keep our code modular. It's a lot of bloat to copy all of these functions into each notebook.&lt;/P&gt;&lt;P&gt;In our top level notebook we are appending the subfolder that has our scrubber udf's in it:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;sys.path.append(os.path.abspath('./scrubbers/'))
from UDFRegistry import UDFRegistry&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The scrubber functions are configurable per tenant so they are registered dynamically using:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def get_known_udf(self, module_name, udf_function_name):  
    udf_module = __import__(module_name)
    udf_function = getattr(udf_module, udf_function_name)
    return udf_function&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As far as we can tell, it's finding the modules when running on a single user cluster and can't find the modules with the udf in them on a shared compute or serverless cluster.&lt;/P&gt;&lt;P&gt;Does anyone have a better way to include udfs from library modules in sub folders? We could move our code to py files and create a wheel package but that creates a ton of complexity and introduces a second versioning system. It's much cleaner for everything to pull from one tagged version branch in our repo.&lt;/P&gt;</description>
    <pubDate>Wed, 10 Jul 2024 18:54:01 GMT</pubDate>
    <dc:creator>thackman</dc:creator>
    <dc:date>2024-07-10T18:54:01Z</dc:date>
    <item>
      <title>Python udfs, Spark Connect, included modules. Compatibility issues with shared compute</title>
      <link>https://community.databricks.com/t5/data-engineering/python-udfs-spark-connect-included-modules-compatibility-issues/m-p/78167#M35482</link>
      <description>&lt;P&gt;Our current system uses Databricks notebooks and we have some shared notebooks that define some python udfs. This was working for us until we tried to switch from single user clusters to shared clusters. Shared clusters and serverless now use Spark Connect and that introduces a lot of behavior changes.&amp;nbsp; There are several ways to include resources in spark but we can't find a combination that allows the worker nodes to find the udf source code. Pulling the udf function directly into our notebook works but we want to keep our code modular. It's a lot of bloat to copy all of these functions into each notebook.&lt;/P&gt;&lt;P&gt;In our top level notebook we are appending the subfolder that has our scrubber udf's in it:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;sys.path.append(os.path.abspath('./scrubbers/'))
from UDFRegistry import UDFRegistry&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The scrubber functions are configurable per tenant so they are registered dynamically using:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def get_known_udf(self, module_name, udf_function_name):  
    udf_module = __import__(module_name)
    udf_function = getattr(udf_module, udf_function_name)
    return udf_function&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As far as we can tell, it's finding the modules when running on a single user cluster and can't find the modules with the udf in them on a shared compute or serverless cluster.&lt;/P&gt;&lt;P&gt;Does anyone have a better way to include udfs from library modules in sub folders? We could move our code to py files and create a wheel package but that creates a ton of complexity and introduces a second versioning system. It's much cleaner for everything to pull from one tagged version branch in our repo.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jul 2024 18:54:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/python-udfs-spark-connect-included-modules-compatibility-issues/m-p/78167#M35482</guid>
      <dc:creator>thackman</dc:creator>
      <dc:date>2024-07-10T18:54:01Z</dc:date>
    </item>
    <item>
      <title>Re: Python udfs, Spark Connect, included modules. Compatibility issues with shared compute</title>
      <link>https://community.databricks.com/t5/data-engineering/python-udfs-spark-connect-included-modules-compatibility-issues/m-p/78874#M35629</link>
      <description>&lt;P&gt;I'm not sure what you mean by "&lt;SPAN&gt;Ensure the Python binary's location is correctly set to resolve runtime issues" . We aren't using any binaries. Everything is just Databricks notebooks.&amp;nbsp; In our case if we define a python udf function in the root notebook then it works fine for both a single user cluster or a shared cluster.&amp;nbsp; If we put the python udf in a child notebook that is included with the %run magic command then the executor nodes can't resolve the udf.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Jul 2024 21:50:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/python-udfs-spark-connect-included-modules-compatibility-issues/m-p/78874#M35629</guid>
      <dc:creator>thackman</dc:creator>
      <dc:date>2024-07-15T21:50:36Z</dc:date>
    </item>
  </channel>
</rss>

