<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: delta live table udf not known when defined in python module in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/53462#M29802</link>
    <description>&lt;P&gt;Hi David, I am having the same issue for "&lt;SPAN&gt;ModuleNotFoundError: No module named ..." when using applyInPandas. Did you ever resolve this?&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 22 Nov 2023 10:54:07 GMT</pubDate>
    <dc:creator>JamesDallimore</dc:creator>
    <dc:date>2023-11-22T10:54:07Z</dc:date>
    <item>
      <title>delta live table udf not known when defined in python module</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/37001#M26237</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have the problem that my "module" is not known when used in a user defined function. The precise message is posted below.&amp;nbsp;I have a repo structure as follows:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;analytics_pipelines
│   ├── __init__.py
│   ├── coordinate_transformation.py
│   ├── data_quality_checks.py
│   ├── pipeline.py
│   └── transformations.py
├── delta_live_tables
│   ├── configurations
│   │   └── data_ingestion.json
│   └── data_ingestion.py
├── dist
├── notebooks
│   ├── local_example.ipynb
│   ├── testdata
│   │   ├── configuration.csv
│   │   ├── input.avro
│   │   └── test.parquet
├── poetry.lock
├── pyproject.toml
├── README.md
├── tests
│   ├── __init__.py
│   └── test_transformations.py&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;in delta_live_tables folder i got a notebook that is doing something like&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;import sys
sys.path.append('/Workspace/Repos/&amp;lt;user&amp;gt;/analytics-data-pipelines/analytics_pipelines')

import pipeline

config = pipeline.setup_config(mode, avro_raw_data)
pipeline.define_ingestion_pipeline(spark, config)
pipeline.define_summary_tables(spark, config)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;in pipeline.define_ingestion_pipeline i define a bunch of delta live tables via the python api. I also import the transformations.py inside the pipeline.py for defining the neccessary data transformations.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;import dlt

from transformations import *
from coordinate_transformation import apply_coordinate_transform

#.....

def define_ingestion_pipeline(spark, config):
    ....
    @dlt.table(
        comment='',
        path = ...
    )
    def table_name():
        data = dlt.read_stream("other")
        return transform_data(data)
    ...    &lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Everything works, except where i use a python user defined function in one of the transformations. The corresponding transformation looks similar:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;def coordinate_transform(group_keys, pdf) -&amp;gt; pd.DataFrame:

    trafo = get_coordinate_transformation(group_keys[0])
    ... do some pandas code here
    return pdf

def apply_coordinate_transform(data):
    ...
    schema = data.schema
    data = data.groupBy('serialnumber',...)\
               .applyInPandas(coordinate_transform, schema=schema)

    return data&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;apparently the coordinate_transformation.py is not available, but why?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Error Message:&lt;/STRONG&gt;&amp;nbsp;&lt;SPAN&gt; File "/databricks/spark/python/pyspark/serializers.py", line 188, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 540, in loads return cloudpickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'coordinate_transformation'&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;(Unrelated:&lt;/P&gt;&lt;P&gt;could somebody point out to me a way how to multiply a 3x3 matrices onto a Nx3 large dataframe ? (resulting in a Nx3 dataframe)&lt;/P&gt;&lt;P&gt;Regards and thanks&lt;/P&gt;&lt;P&gt;David&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Jul 2023 14:50:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/37001#M26237</guid>
      <dc:creator>david3</dc:creator>
      <dc:date>2023-07-05T14:50:48Z</dc:date>
    </item>
    <item>
      <title>Re: delta live table udf not known when defined in python module</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/53462#M29802</link>
      <description>&lt;P&gt;Hi David, I am having the same issue for "&lt;SPAN&gt;ModuleNotFoundError: No module named ..." when using applyInPandas. Did you ever resolve this?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 22 Nov 2023 10:54:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/53462#M29802</guid>
      <dc:creator>JamesDallimore</dc:creator>
      <dc:date>2023-11-22T10:54:07Z</dc:date>
    </item>
    <item>
      <title>Re: delta live table udf not known when defined in python module</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/54618#M30132</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hello David. Same as James : "I am having the same issue for "&lt;/SPAN&gt;&lt;SPAN&gt;ModuleNotFoundError: No module named ..." when using applyInPandas. Did you ever resolve this?"&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Dec 2023 21:51:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/54618#M30132</guid>
      <dc:creator>Carlose</dc:creator>
      <dc:date>2023-12-04T21:51:55Z</dc:date>
    </item>
    <item>
      <title>Re: delta live table udf not known when defined in python module</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/54643#M30135</link>
      <description>&lt;P&gt;I solved this by defining the applyInPandas transformation inside of the function that it's used in.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;
def apply_coordinate_transform(data):
    def coordinate_transform(group_keys, pdf) -&amp;gt; pd.DataFrame:
    
        trafo = get_coordinate_transformation(group_keys[0])
        ... do some pandas code here
        return pdf
    ...
    schema = data.schema
    data = data.groupBy('serialnumber',...)\
               .applyInPandas(coordinate_transform, schema=schema)

    return data&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 05 Dec 2023 09:24:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/54643#M30135</guid>
      <dc:creator>JamesDallimore</dc:creator>
      <dc:date>2023-12-05T09:24:17Z</dc:date>
    </item>
    <item>
      <title>Re: delta live table udf not known when defined in python module</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/54645#M30136</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;&lt;P&gt;yes, I discovered three working possibilities:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Define the pandas functions as inline function as pointed out above&lt;/LI&gt;&lt;LI&gt;Define the pandas function in the same script that is imported as "library" in the dlt config (&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;PRE&gt;libraries:
- notebook:
path: ./pipelines/your_dlt_declaration_containing_high_level_udfs.py​&lt;/PRE&gt;&lt;/LI&gt;&lt;LI&gt;install your python library as whl on the cluster&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;DAvid&lt;/P&gt;</description>
      <pubDate>Tue, 05 Dec 2023 10:52:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-table-udf-not-known-when-defined-in-python-module/m-p/54645#M30136</guid>
      <dc:creator>david3</dc:creator>
      <dc:date>2023-12-05T10:52:04Z</dc:date>
    </item>
  </channel>
</rss>

