<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Python DataSource API utilities/ Import Fails in Spark Declarative Pipeline in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/python-datasource-api-utilities-import-fails-in-spark/m-p/144281#M52307</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176516"&gt;@emma_s&lt;/a&gt;&amp;nbsp; &lt;SPAN&gt;Thank you for the guidance! The wheel package approach worked perfectly.&lt;/SPAN&gt;&lt;BR /&gt;I also tried putting the&amp;nbsp;&lt;STRONG&gt;.py&amp;nbsp;&lt;/STRONG&gt;directly in but it did not work&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;/Workspace/Libraries/custom_datasource.py&lt;/LI-CODE&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="smpa01_0-1768602190081.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/23041iE351439CC0553234/image-size/medium?v=v2&amp;amp;px=400" role="button" title="smpa01_0-1768602190081.png" alt="smpa01_0-1768602190081.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 16 Jan 2026 22:23:58 GMT</pubDate>
    <dc:creator>smpa01</dc:creator>
    <dc:date>2026-01-16T22:23:58Z</dc:date>
    <item>
      <title>Python DataSource API utilities/ Import Fails in Spark Declarative Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/python-datasource-api-utilities-import-fails-in-spark/m-p/144203#M52283</link>
      <description>&lt;DIV&gt;&lt;STRONG&gt;TLDR -&amp;nbsp;&lt;/STRONG&gt;UDFs work fine when imported from `utilities/` folder in DLT pipelines, but custom Python DataSource APIs fail with ModuleNotFoundError: No module named 'utilities'` during serialization. Only inline definitions work. Need reusable DataSource classes across multiple transformations.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;Context&lt;/STRONG&gt; -&amp;nbsp;Databricks auto-generated SDP works perfectly with utilities folder pattern:&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Working Structure:&lt;/DIV&gt;&lt;DIV&gt;```&lt;/DIV&gt;&lt;DIV&gt;SDP_Project/&lt;/DIV&gt;&lt;DIV&gt;├── transformations/&lt;/DIV&gt;&lt;DIV&gt;│&amp;nbsp; &amp;nbsp;└── pipeline.py&lt;/DIV&gt;&lt;DIV&gt;└── utilities/&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; └── utils.py&lt;/DIV&gt;&lt;DIV&gt;```&lt;/DIV&gt;&lt;DIV&gt;Working Code:&lt;/DIV&gt;&lt;DIV&gt;```python&lt;/DIV&gt;&lt;DIV&gt;# transformations/pipeline.py&lt;/DIV&gt;&lt;DIV&gt;from utilities import utils&lt;/DIV&gt;&lt;DIV&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25059"&gt;@DP&lt;/a&gt;.table&lt;/DIV&gt;&lt;DIV&gt;def my_table():&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; return df.withColumn("valid", utils.is_valid_email(col("email")))&lt;/DIV&gt;&lt;DIV&gt;# utilities/utils.py&lt;/DIV&gt;&lt;DIV&gt;from pyspark.sql.functions import udf&lt;/DIV&gt;&lt;DIV&gt;@udf(returnType=BooleanType())&lt;/DIV&gt;&lt;DIV&gt;def is_valid_email(email):&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; return re.match(pattern, email) is not None&lt;/DIV&gt;&lt;DIV&gt;```&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;Problem -&lt;/STRONG&gt;Same pattern fails with python-data-source-api&lt;/DIV&gt;&lt;DIV&gt;My Structure:&lt;/DIV&gt;&lt;DIV&gt;```&lt;/DIV&gt;&lt;DIV&gt;Test_DLT/&lt;/DIV&gt;&lt;DIV&gt;├── transformations/&lt;/DIV&gt;&lt;DIV&gt;│&amp;nbsp; &amp;nbsp;└── dlt.py&amp;nbsp;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;└── utilities/&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; └── utils.py&lt;/DIV&gt;&lt;DIV&gt;```&lt;/DIV&gt;&lt;DIV&gt;My Code:&lt;/DIV&gt;&lt;DIV&gt;```python&lt;/DIV&gt;&lt;DIV&gt;# utilities/utils.py&lt;/DIV&gt;&lt;DIV&gt;class SomeCustomReader(DataSourceReader):&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; def read(self, partition):&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; # custom logic here&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; yield Row(**data)&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;class SomeCustomSource(DataSource):&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; @classmethod&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; def name(cls):&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; return "some_source"&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; def reader(self, schema):&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; return SomeCustomReader(self.options)&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;# transformations/dlt.py&lt;/DIV&gt;&lt;DIV&gt;from utilities.utils import SomeCustomSource&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;spark = SparkSession.builder.getOrCreate()&lt;/DIV&gt;&lt;DIV&gt;spark.dataSource.register(SomeCustomSource)&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25059"&gt;@DP&lt;/a&gt;.table&lt;/DIV&gt;&lt;DIV&gt;def read_data():&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; df = spark.read.format("some_source").option(...).load()&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; return df&lt;/DIV&gt;&lt;DIV&gt;```&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;Error&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;```&lt;/DIV&gt;&lt;DIV&gt;ModuleNotFoundError: No module named 'utilities'&lt;/DIV&gt;&lt;DIV&gt;pyspark.serializers.SerializationError: Caused by cloudpickle.loads(obj, encoding=encoding)&lt;/DIV&gt;&lt;DIV&gt;```&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;What Works&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;Only inline class definitions work (defining DataSource classes directly in dlt.py).&lt;/DIV&gt;&lt;DIV&gt;Questions&lt;/DIV&gt;&lt;DIV&gt;1. Am I doing anything wrong?&lt;/DIV&gt;&lt;DIV&gt;2. Is there a way to make custom DataSource APIs work with the utilities folder pattern?&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;Benefit&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;I want to avoid writing it inline for n transformations n times if possible - need reusable DataSource definitions.&lt;/DIV&gt;</description>
      <pubDate>Fri, 16 Jan 2026 04:58:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/python-datasource-api-utilities-import-fails-in-spark/m-p/144203#M52283</guid>
      <dc:creator>smpa01</dc:creator>
      <dc:date>2026-01-16T04:58:54Z</dc:date>
    </item>
    <item>
      <title>Re: Python DataSource API utilities/ Import Fails in Spark Declarative Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/python-datasource-api-utilities-import-fails-in-spark/m-p/144275#M52304</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;Your findings are correct. Because of the way spark works it can distribute UDFs across the worker nodes but it can't do the same with a class in another folder like the utilities folder. It relies on having these as an importable python package, therefore If your utilities directory isn’t part of a packaged (installed) Python module, it isn’t distributed to worker nodes on Databricks cluster and therefore it can't be used.&lt;BR /&gt;To get around this you need to update the settings in the pipeline so it sees it as a python module this doc should help you&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/ldp/import-workspace-files" target="_blank"&gt;https://docs.databricks.com/aws/en/ldp/import-workspace-files. &lt;/A&gt;&lt;/P&gt;
&lt;P&gt;In particular, this section&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="emma_s_0-1768585750144.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/23039i2D5A114FD3D979F3/image-size/medium?v=v2&amp;amp;px=400" role="button" title="emma_s_0-1768585750144.png" alt="emma_s_0-1768585750144.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;I hope that helps.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Jan 2026 17:49:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/python-datasource-api-utilities-import-fails-in-spark/m-p/144275#M52304</guid>
      <dc:creator>emma_s</dc:creator>
      <dc:date>2026-01-16T17:49:28Z</dc:date>
    </item>
    <item>
      <title>Re: Python DataSource API utilities/ Import Fails in Spark Declarative Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/python-datasource-api-utilities-import-fails-in-spark/m-p/144281#M52307</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176516"&gt;@emma_s&lt;/a&gt;&amp;nbsp; &lt;SPAN&gt;Thank you for the guidance! The wheel package approach worked perfectly.&lt;/SPAN&gt;&lt;BR /&gt;I also tried putting the&amp;nbsp;&lt;STRONG&gt;.py&amp;nbsp;&lt;/STRONG&gt;directly in but it did not work&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;/Workspace/Libraries/custom_datasource.py&lt;/LI-CODE&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="smpa01_0-1768602190081.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/23041iE351439CC0553234/image-size/medium?v=v2&amp;amp;px=400" role="button" title="smpa01_0-1768602190081.png" alt="smpa01_0-1768602190081.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Jan 2026 22:23:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/python-datasource-api-utilities-import-fails-in-spark/m-p/144281#M52307</guid>
      <dc:creator>smpa01</dc:creator>
      <dc:date>2026-01-16T22:23:58Z</dc:date>
    </item>
  </channel>
</rss>

