<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Trying to use Broadcast to run Presidio distrubuted in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/trying-to-use-broadcast-to-run-presidio-distrubuted/m-p/111881#M44027</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am currently evaluating using Microsoft's &lt;A href="https://microsoft.github.io/presidio/" target="_blank" rel="noopener"&gt;Presidio&lt;/A&gt;&amp;nbsp;de-identification libraries for my organization and would like to see if we can take advantage to Sparks broadcast capabilities, but I am getting an error message:&lt;/P&gt;&lt;P&gt;"[&lt;SPAN&gt;BROADCAST_VARIABLE_NOT_LOADED] Broadcast variable `2872` not loaded."&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;According the Databricks Cluster KB, using broadcast on a shared cluster is not possible (&lt;A href="https://kb.databricks.com/clusters/broadcast_variable_not_loaded-or-jvm_attribute_not_supported-errors-when-using-broadcast-variables-in-a-shared-access-mode-cluster" target="_self"&gt;Cluster KB&lt;/A&gt;) and the solution is to "&lt;SPAN&gt;Use a single-user cluster or pass a variable into a function as a state instead&lt;/SPAN&gt;". I am not able to do a single-user cluster and my organization does not currently allow that and I am frankly confused by "&lt;SPAN&gt;pass a variable into a function as a state instead&lt;/SPAN&gt;" and how to do that in Databricks Spark.&lt;/P&gt;&lt;P&gt;If someone could provide some guidance on the State variable into a function that would be greatly appreciated.&lt;/P&gt;&lt;P&gt;Below is the code that I am trying to run that gives me the error.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import StringType

anonymized_column = "note_text" # name of column to anonymize
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

broadcasted_analyzer = sc.broadcast(analyzer)
broadcasted_anonymizer = sc.broadcast(anonymizer)
# broadcast the engines to the cluster nodes
spark.conf.set("spark.databricks.broadcastTimeout", "600s")
# define a pandas UDF function and a series function over it.
def anonymize_text(text: str) -&amp;gt; str:
    analyzer = broadcasted_analyzer.value
    anonymizer = broadcasted_anonymizer.value
    analyzer_results = analyzer.analyze(text=text, language="en")
    anonymized_results = anonymizer.anonymize(
        text=text,
        analyzer_results=analyzer_results,
        operators={
            "DEFAULT": OperatorConfig("replace", {"new_value": "&amp;lt;ANONYMIZED&amp;gt;"})
        },
    )
    return anonymized_results.text

def anonymize_series(s: pd.Series) -&amp;gt; pd.Series:
    return s.apply(anonymize_text)

# define a the function as pandas UDF
anonymize = pandas_udf(anonymize_series, returnType=StringType())

# convert Pandas DataFrame to Spark DataFrame
spark_report_df = spark.createDataFrame(report_df)

# apply the udf
anonymized_df = spark_report_df.withColumn(
    anonymized_column, anonymize(col(anonymized_column))
)
display(anonymized_df)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 05 Mar 2025 23:53:07 GMT</pubDate>
    <dc:creator>kertsman_nm</dc:creator>
    <dc:date>2025-03-05T23:53:07Z</dc:date>
    <item>
      <title>Trying to use Broadcast to run Presidio distrubuted</title>
      <link>https://community.databricks.com/t5/data-engineering/trying-to-use-broadcast-to-run-presidio-distrubuted/m-p/111881#M44027</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am currently evaluating using Microsoft's &lt;A href="https://microsoft.github.io/presidio/" target="_blank" rel="noopener"&gt;Presidio&lt;/A&gt;&amp;nbsp;de-identification libraries for my organization and would like to see if we can take advantage to Sparks broadcast capabilities, but I am getting an error message:&lt;/P&gt;&lt;P&gt;"[&lt;SPAN&gt;BROADCAST_VARIABLE_NOT_LOADED] Broadcast variable `2872` not loaded."&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;According the Databricks Cluster KB, using broadcast on a shared cluster is not possible (&lt;A href="https://kb.databricks.com/clusters/broadcast_variable_not_loaded-or-jvm_attribute_not_supported-errors-when-using-broadcast-variables-in-a-shared-access-mode-cluster" target="_self"&gt;Cluster KB&lt;/A&gt;) and the solution is to "&lt;SPAN&gt;Use a single-user cluster or pass a variable into a function as a state instead&lt;/SPAN&gt;". I am not able to do a single-user cluster and my organization does not currently allow that and I am frankly confused by "&lt;SPAN&gt;pass a variable into a function as a state instead&lt;/SPAN&gt;" and how to do that in Databricks Spark.&lt;/P&gt;&lt;P&gt;If someone could provide some guidance on the State variable into a function that would be greatly appreciated.&lt;/P&gt;&lt;P&gt;Below is the code that I am trying to run that gives me the error.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import StringType

anonymized_column = "note_text" # name of column to anonymize
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

broadcasted_analyzer = sc.broadcast(analyzer)
broadcasted_anonymizer = sc.broadcast(anonymizer)
# broadcast the engines to the cluster nodes
spark.conf.set("spark.databricks.broadcastTimeout", "600s")
# define a pandas UDF function and a series function over it.
def anonymize_text(text: str) -&amp;gt; str:
    analyzer = broadcasted_analyzer.value
    anonymizer = broadcasted_anonymizer.value
    analyzer_results = analyzer.analyze(text=text, language="en")
    anonymized_results = anonymizer.anonymize(
        text=text,
        analyzer_results=analyzer_results,
        operators={
            "DEFAULT": OperatorConfig("replace", {"new_value": "&amp;lt;ANONYMIZED&amp;gt;"})
        },
    )
    return anonymized_results.text

def anonymize_series(s: pd.Series) -&amp;gt; pd.Series:
    return s.apply(anonymize_text)

# define a the function as pandas UDF
anonymize = pandas_udf(anonymize_series, returnType=StringType())

# convert Pandas DataFrame to Spark DataFrame
spark_report_df = spark.createDataFrame(report_df)

# apply the udf
anonymized_df = spark_report_df.withColumn(
    anonymized_column, anonymize(col(anonymized_column))
)
display(anonymized_df)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Mar 2025 23:53:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trying-to-use-broadcast-to-run-presidio-distrubuted/m-p/111881#M44027</guid>
      <dc:creator>kertsman_nm</dc:creator>
      <dc:date>2025-03-05T23:53:07Z</dc:date>
    </item>
    <item>
      <title>Re: Trying to use Broadcast to run Presidio distrubuted</title>
      <link>https://community.databricks.com/t5/data-engineering/trying-to-use-broadcast-to-run-presidio-distrubuted/m-p/138288#M50900</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;You’re encountering the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;[BROADCAST_VARIABLE_NOT_LOADED]&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;error because Databricks in shared access mode cannot use broadcast variables with non-serializable Python objects (such as your Presidio engines) due to cluster architecture limitations. The cluster KB is correct: you must use a different approach—namely, “pass a variable into a function as state instead.”​&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Solutions for Passing Variables as State&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;In PySpark, instead of using broadcast variables (which require serialization), you should:&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Instantiate your analyzer and anonymizer inside the UDF function.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;This ensures each worker node creates its own local instance when the UDF executes.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;This pattern is commonly called "lazy initialization" and avoids broadcast altogether.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Most notably,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;do not instantiate your Presidio engines outside the UDF or attempt to broadcast them.&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Instead, initialize them within the UDF’s scope.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Example Rewrite for Databricks Shared Cluster&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Here’s how you could change your code:&lt;/P&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;python&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;&lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; pandas &lt;SPAN class="token token"&gt;as&lt;/SPAN&gt; pd
&lt;SPAN class="token token"&gt;from&lt;/SPAN&gt; pyspark&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;sql&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;functions &lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; col&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; pandas_udf
&lt;SPAN class="token token"&gt;from&lt;/SPAN&gt; pyspark&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;sql&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;types &lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; StringType

anonymized_column &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;"note_text"&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;# name of column to anonymize&lt;/SPAN&gt;

&lt;SPAN class="token token"&gt;def&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;anonymize_text&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;text&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;str&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt; &lt;SPAN class="token token operator"&gt;-&lt;/SPAN&gt;&lt;SPAN class="token token operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;str&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
    &lt;SPAN class="token token"&gt;# Initialize inside the function so each worker gets their own instance&lt;/SPAN&gt;
    &lt;SPAN class="token token"&gt;from&lt;/SPAN&gt; presidio_analyzer &lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; AnalyzerEngine
    &lt;SPAN class="token token"&gt;from&lt;/SPAN&gt; presidio_anonymizer &lt;SPAN class="token token"&gt;import&lt;/SPAN&gt; AnonymizerEngine&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; OperatorConfig

    analyzer &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; AnalyzerEngine&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
    anonymizer &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; AnonymizerEngine&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
    analyzer_results &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; analyzer&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;analyze&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;text&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;text&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; language&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"en"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
    anonymized_results &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; anonymizer&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;anonymize&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;
        text&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;text&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt;
        analyzer_results&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;analyzer_results&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt;
        operators&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;{&lt;/SPAN&gt;
            &lt;SPAN class="token token"&gt;"DEFAULT"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt; OperatorConfig&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"replace"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; &lt;SPAN class="token token punctuation"&gt;{&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;"new_value"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;"&amp;lt;ANONYMIZED&amp;gt;"&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;}&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
        &lt;SPAN class="token token punctuation"&gt;}&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt;
    &lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
    &lt;SPAN class="token token"&gt;return&lt;/SPAN&gt; anonymized_results&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;text

&lt;SPAN class="token token"&gt;def&lt;/SPAN&gt; &lt;SPAN class="token token"&gt;anonymize_series&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;s&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt; pd&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;Series&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt; &lt;SPAN class="token token operator"&gt;-&lt;/SPAN&gt;&lt;SPAN class="token token operator"&gt;&amp;gt;&lt;/SPAN&gt; pd&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;Series&lt;SPAN class="token token punctuation"&gt;:&lt;/SPAN&gt;
    &lt;SPAN class="token token"&gt;return&lt;/SPAN&gt; s&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;&lt;SPAN class="token token"&gt;apply&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;anonymize_text&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;

&lt;SPAN class="token token"&gt;# define the function as pandas UDF&lt;/SPAN&gt;
anonymize &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; pandas_udf&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;anonymize_series&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; returnType&lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt;StringType&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;

&lt;SPAN class="token token"&gt;# convert Pandas DataFrame to Spark DataFrame&lt;/SPAN&gt;
spark_report_df &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; spark&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;createDataFrame&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;report_df&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;

&lt;SPAN class="token token"&gt;# apply the udf&lt;/SPAN&gt;
anonymized_df &lt;SPAN class="token token operator"&gt;=&lt;/SPAN&gt; spark_report_df&lt;SPAN class="token token punctuation"&gt;.&lt;/SPAN&gt;withColumn&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;
    anonymized_column&lt;SPAN class="token token punctuation"&gt;,&lt;/SPAN&gt; anonymize&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;col&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;anonymized_column&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
display&lt;SPAN class="token token punctuation"&gt;(&lt;/SPAN&gt;anonymized_df&lt;SPAN class="token token punctuation"&gt;)&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Key Points&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The analyzer and anonymizer are&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;not broadcasted&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;or shared, but created fresh for each executor.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;This avoids serialization issues and gets around the Databricks shared mode restrictions.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The performance trade-off is that each executor must initialize its own copy, but it's necessary in this environment.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Additional Recommendations&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;If initialization time for the engines is significant and you have control over your cluster policies, you could investigate using singleton patterns within the workers. Otherwise, instantiation inside the UDF is safest for compatibility.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Always refer to Presidio’s documentation for any specific serialization or concurrency guidance for its engines.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Cluster KB Reference&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Here’s the relevant advice from Databricks KB:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;“Use a single-user cluster or pass a variable into a function as a state instead.” This means initializing the variable within the function or UDF, NOT as a broadcast variable.​&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;HR /&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Summary Table&lt;/H2&gt;
&lt;DIV class="group relative"&gt;
&lt;DIV class="w-full overflow-x-auto md:max-w-[90vw] border-subtlest ring-subtlest divide-subtlest bg-transparent"&gt;
&lt;TABLE class="border-subtler my-[1em] w-full table-auto border-separate border-spacing-0 border-l border-t"&gt;
&lt;THEAD class="bg-subtler"&gt;
&lt;TR&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Approach&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Supported in Shared Mode&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Usage&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Notes&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Broadcast Variables&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;&lt;span class="lia-unicode-emoji" title=":cross_mark:"&gt;❌&lt;/span&gt; No&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;sc.broadcast(obj)&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Fails with non-serializable objects&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;State Variable in Function/UDF&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;&lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt; Yes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;init inside function/UDF&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Best for Python objects&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;/DIV&gt;
&lt;DIV class="bg-base border-subtler shadow-subtle pointer-coarse:opacity-100 right-xs absolute bottom-0 flex rounded-lg border opacity-0 transition-opacity group-hover:opacity-100 [&amp;amp;&amp;gt;*:not(:first-child)]:border-subtle [&amp;amp;&amp;gt;*:not(:first-child)]:border-l"&gt;
&lt;DIV class="flex"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="flex"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;HR /&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Switching to UDF-level initialization is the standard solution for Presidio usage in Databricks shared clusters. If you need explicitly efficient instantiation, consider reaching out via the Presidio GitHub discussions or support email for optimization tips.​&lt;/P&gt;</description>
      <pubDate>Sun, 09 Nov 2025 14:31:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trying-to-use-broadcast-to-run-presidio-distrubuted/m-p/138288#M50900</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-11-09T14:31:40Z</dc:date>
    </item>
  </channel>
</rss>

