<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Executing Python code inside a SQL Function in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98328#M39694</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/95713"&gt;@Dp15&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The error stack trace doesn't point to an error with running Spark SQL inside the SQL UDF but instead with respect to the call to the SparkSession created from inside the UDF; this is not permitted. Furthermore, the UDF runs in a separate execution context that does not have direct access to the main Spark context or session, which is why you see the SystemExit error.&lt;/P&gt;
&lt;P&gt;The Spark session should be created outside the UDF, and the queries executed before passing the results to the UDF for further processing.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Is there a way to run spark sql inside the SQL UDF?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;So, you can run sql inside your udf, but you can't initialize the sparksession there.&lt;/P&gt;</description>
    <pubDate>Mon, 11 Nov 2024 12:32:19 GMT</pubDate>
    <dc:creator>VZLA</dc:creator>
    <dc:date>2024-11-11T12:32:19Z</dc:date>
    <item>
      <title>Executing Python code inside a SQL Function</title>
      <link>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98214#M39651</link>
      <description>&lt;P&gt;Hi ,&lt;/P&gt;&lt;P&gt;I am trying to create a SQL UDF and I am trying to run some python code involving pyspark, I am not able to create a spark session inside the python section of the function, here is how my code looks,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;	CREATE OR REPLACE FUNCTION test.getValuesFromTable(field1 INT,field2 INT)
	RETURNS Map&amp;lt;STRING,ARRAY&amp;lt;STRING&amp;gt;&amp;gt;
	  LANGUAGE PYTHON
	  AS $$
		from pyspark.sql.functions import col
		import numpy as np
		import pyspark
		from pyspark.sql import SparkSession
		spark = SparkSession.builder.appName("sample").getOrCreate()
		def getqueryList():
			query1 = "select distinct(value1) from test.table"
			query2 ="select distinct(value2) from test.table"
			

			query_list = [query1,query2]

			return query_list



		def executeQuery(field1,field2,query):
			query = query + " where field1 = {} and field2 = {}".format(field1,field2)
			return spark.sql(query)

		def getValues(field1,field2):
			result_dict = {}
			result_list = []
			df_list = [executeQuery(field1,field2,query) for query in getqueryList()]
			for df in df_list:
				fieldName = df.schema.names[0]    
				result_list = [row[0] for row in df.select(fieldName).collect()]
				result_dict[fieldName] = np.array(result_list)
			return result_dict
		
		return getValues(field1,field2)
		$$&lt;/LI-CODE&gt;&lt;P&gt;when&amp;nbsp; I try to execute the function I am not able to invoke a spark session,&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;SystemExit: -1
== Stacktrace ==
  File "&amp;lt;udfbody&amp;gt;", line 6, in main
    spark = SparkSession.builder.appName("sample").getOrCreate()
  File "/databricks/spark/python/pyspark/sql/session.py", line 562, in getOrCreate
    else SparkContext.getOrCreate(sparkConf)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/core/context.py", line 574, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/databricks/spark/python/pyspark/core/context.py", line 206, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/databricks/spark/python/pyspark/core/context.py", line 495, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
                                       ^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/java_gateway.py", line 63, in launch_gateway
    SPARK_HOME = _find_spark_home()&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;Is there a way to run spark sql inside the SQL UDF?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Nov 2024 20:34:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98214#M39651</guid>
      <dc:creator>Dp15</dc:creator>
      <dc:date>2024-11-08T20:34:24Z</dc:date>
    </item>
    <item>
      <title>Re: Executing Python code inside a SQL Function</title>
      <link>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98328#M39694</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/95713"&gt;@Dp15&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The error stack trace doesn't point to an error with running Spark SQL inside the SQL UDF but instead with respect to the call to the SparkSession created from inside the UDF; this is not permitted. Furthermore, the UDF runs in a separate execution context that does not have direct access to the main Spark context or session, which is why you see the SystemExit error.&lt;/P&gt;
&lt;P&gt;The Spark session should be created outside the UDF, and the queries executed before passing the results to the UDF for further processing.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Is there a way to run spark sql inside the SQL UDF?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;So, you can run sql inside your udf, but you can't initialize the sparksession there.&lt;/P&gt;</description>
      <pubDate>Mon, 11 Nov 2024 12:32:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98328#M39694</guid>
      <dc:creator>VZLA</dc:creator>
      <dc:date>2024-11-11T12:32:19Z</dc:date>
    </item>
    <item>
      <title>Re: Executing Python code inside a SQL Function</title>
      <link>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98343#M39701</link>
      <description>&lt;P&gt;So something like this should work?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sample").getOrCreate()
getValues(spark,scenario_id,reporting_date)
       &lt;/LI-CODE&gt;&lt;P&gt;I tried this but I am not able to pass SparkSession object to the method definition&lt;BR /&gt;&lt;SPAN&gt;Unsupported data type "SPARKSESSION"&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Nov 2024 14:34:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98343#M39701</guid>
      <dc:creator>Dp15</dc:creator>
      <dc:date>2024-11-11T14:34:47Z</dc:date>
    </item>
    <item>
      <title>Re: Executing Python code inside a SQL Function</title>
      <link>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98348#M39702</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/95713"&gt;@Dp15&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My apologies for the confusion, I've edited my previous comment (&lt;U&gt;and wrong statement[1]&lt;/U&gt;) to make it clear:&lt;/P&gt;
&lt;DIV id="bodyDisplay_0" class="lia-message-body lia-component-message-view-widget-body lia-component-body-signature-highlight-escalation lia-component-message-view-widget-body-signature-highlight-escalation"&gt;
&lt;DIV class="lia-message-body-content"&gt;
&lt;P&gt;&lt;U&gt;&lt;EM&gt;[1] So, you can run &lt;STRONG&gt;code&lt;/STRONG&gt; inside your udf, but you can't initialize the sparksession there &lt;STRONG&gt;or reuse one&lt;/STRONG&gt;.&lt;/EM&gt;&lt;/U&gt;&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;Executing "spark.sql" statements within a PySparkUDF is not feasible as we can't create a SparkSession object inside the Executors or use one in Executors as it is part of the Driver context only.&amp;nbsp;UDFs are typically used to apply custom transformations to individual DataFrame rows or columns, but retrieving distinct values from specific columns based on certain conditions can be done using PySparks's build int DF operations.&lt;/P&gt;
&lt;P&gt;Will this work for your use case instead?&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder.appName("sample").getOrCreate()

def get_values_from_table(field1_value, field2_value):
    df = spark.table("test.table")
    
    filtered_df = df.filter((col("field1") == field1_value) &amp;amp; (col("field2") == field2_value))
    
    result_dict = {}
    
    columns_to_query = ["value1", "value2"]
    
    for column in columns_to_query:
        distinct_values = filtered_df.select(column).distinct().rdd.flatMap(lambda x: x).collect()
        result_dict[column] = distinct_values
    
    return result_dict

# Example usage
field1_value = 10
field2_value = 20
result = get_values_from_table(field1_value, field2_value)
print(result)
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Nov 2024 15:25:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98348#M39702</guid>
      <dc:creator>VZLA</dc:creator>
      <dc:date>2024-11-11T15:25:59Z</dc:date>
    </item>
    <item>
      <title>Re: Executing Python code inside a SQL Function</title>
      <link>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98350#M39704</link>
      <description>&lt;P&gt;Actually this would if I am using it in a native notebook environment, however I am trying to create a UDF because I want these queries to be executed from an external, JDBC connection, and I dont wish to wait for the cluster to spin up for a notebook, is there a way I can achieve this?&lt;/P&gt;</description>
      <pubDate>Mon, 11 Nov 2024 16:04:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/executing-python-code-inside-a-sql-function/m-p/98350#M39704</guid>
      <dc:creator>Dp15</dc:creator>
      <dc:date>2024-11-11T16:04:28Z</dc:date>
    </item>
  </channel>
</rss>

