<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Issue with UDF's and DLT where UDF is multi layered and externalized in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/issue-with-udf-s-and-dlt-where-udf-is-multi-layered-and/m-p/111507#M43916</link>
    <description>&lt;DIV&gt;Having issue getting UDF's to work within a DLT where the UDF is externalized outside of the notebook and it attempts to call other functions.&amp;nbsp; End goal to put unit test coverage around the various functions, hence the pattern.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;For test purpose I created a couple of simple UDF functions in a file, udf.py.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from pyspark.sql.functions import udf&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from pyspark.sql.types import StringType&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;# First UDF that reverses a string&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;def reverse_string(x):&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; return x[::-1]&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;reverse_udf = udf(reverse_string, StringType())&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;# Second UDF that calls the first UDF and then converts the result to uppercase&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;def upper_and_reverse_string(x):&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; reversed_string = reverse_string(x)&amp;nbsp; # Call the first UDF function&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; return reversed_string.upper()&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;upper_reverse_udf = udf(upper_and_reverse_string, StringType())&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;When I attempt to utilize the upper_reverse_udf function I keep getting a broken pipe (UDF_PYSPARK_ERROR.UNKNOWN) error.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;import dlt&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from pyspark.sql.functions import col&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from udf import *&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97035"&gt;@Dlt&lt;/a&gt;.table&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;def test_transformed_table():&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; df = dlt.read("test_table").select("system")&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; return df.withColumn("processed_systems", upper_reverse_udf(col("system")))&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;The code works fine outside of DLT, just an issue within a DLT.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from pyspark.sql.functions import col&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from udf import *&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;df = spark.read.table("test_table").select("system")&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;df = df.withColumn("processed_systems", upper_reverse_udf(col("system")))&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;display (df)&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I'm not seeing anything useful in the logs.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Runtime version for Pipeline is "dlt:15.4.9-delta-pipelines-dlt-release-dp-2025.07-rc0-commit-7005556-image-9d7698a"&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;If I embed all the code into a single function it seems to work, but I lose the ability to reuse functions across different UDF's.&lt;/DIV&gt;</description>
    <pubDate>Sat, 01 Mar 2025 18:36:38 GMT</pubDate>
    <dc:creator>drollason</dc:creator>
    <dc:date>2025-03-01T18:36:38Z</dc:date>
    <item>
      <title>Issue with UDF's and DLT where UDF is multi layered and externalized</title>
      <link>https://community.databricks.com/t5/data-engineering/issue-with-udf-s-and-dlt-where-udf-is-multi-layered-and/m-p/111507#M43916</link>
      <description>&lt;DIV&gt;Having issue getting UDF's to work within a DLT where the UDF is externalized outside of the notebook and it attempts to call other functions.&amp;nbsp; End goal to put unit test coverage around the various functions, hence the pattern.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;For test purpose I created a couple of simple UDF functions in a file, udf.py.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from pyspark.sql.functions import udf&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from pyspark.sql.types import StringType&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;# First UDF that reverses a string&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;def reverse_string(x):&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; return x[::-1]&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;reverse_udf = udf(reverse_string, StringType())&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;# Second UDF that calls the first UDF and then converts the result to uppercase&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;def upper_and_reverse_string(x):&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; reversed_string = reverse_string(x)&amp;nbsp; # Call the first UDF function&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; return reversed_string.upper()&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;upper_reverse_udf = udf(upper_and_reverse_string, StringType())&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;When I attempt to utilize the upper_reverse_udf function I keep getting a broken pipe (UDF_PYSPARK_ERROR.UNKNOWN) error.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;import dlt&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from pyspark.sql.functions import col&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from udf import *&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97035"&gt;@Dlt&lt;/a&gt;.table&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;def test_transformed_table():&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; df = dlt.read("test_table").select("system")&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; return df.withColumn("processed_systems", upper_reverse_udf(col("system")))&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;The code works fine outside of DLT, just an issue within a DLT.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from pyspark.sql.functions import col&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;from udf import *&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;df = spark.read.table("test_table").select("system")&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;df = df.withColumn("processed_systems", upper_reverse_udf(col("system")))&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;EM&gt;display (df)&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I'm not seeing anything useful in the logs.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Runtime version for Pipeline is "dlt:15.4.9-delta-pipelines-dlt-release-dp-2025.07-rc0-commit-7005556-image-9d7698a"&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;If I embed all the code into a single function it seems to work, but I lose the ability to reuse functions across different UDF's.&lt;/DIV&gt;</description>
      <pubDate>Sat, 01 Mar 2025 18:36:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issue-with-udf-s-and-dlt-where-udf-is-multi-layered-and/m-p/111507#M43916</guid>
      <dc:creator>drollason</dc:creator>
      <dc:date>2025-03-01T18:36:38Z</dc:date>
    </item>
    <item>
      <title>Re: Issue with UDF's and DLT where UDF is multi layered and externalized</title>
      <link>https://community.databricks.com/t5/data-engineering/issue-with-udf-s-and-dlt-where-udf-is-multi-layered-and/m-p/111936#M44049</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/143020"&gt;@drollason&lt;/a&gt;.&amp;nbsp;In DLT pipelines, I would try packaging up your code as a wheel and then install it via pip. I had the same scenario as you and was able to bring in my custom code this way.&lt;/P&gt;</description>
      <pubDate>Thu, 06 Mar 2025 19:03:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issue-with-udf-s-and-dlt-where-udf-is-multi-layered-and/m-p/111936#M44049</guid>
      <dc:creator>bgiesbrecht</dc:creator>
      <dc:date>2025-03-06T19:03:44Z</dc:date>
    </item>
  </channel>
</rss>

