topic Re: sparkContext in Runtime 15.3 in Data Engineering

sparkContext in Runtime 15.3

rushi29 — Fri, 02 Aug 2024 12:56:22 GMT

Hello All,

Our Azure databricks cluster is running under "Legacy Shared Compute" policy with 15.3 runtime. One of the python notebooks is used to connect to an Azure SQL database to read/insert data. The following snippet of code is responsible for running queries to insert/update data in Azure SQL database and to execute stored procedures. All of this works without any issues. However, we have now upgraded our environment to Unity Catalog and we want to start using the Unity Catalog instead of the hive_metastore. To write data to Unity Catalog catalog, the cluster must be running in "Shared Compute" policy and not "Legacy Shared Compute". Unfortunately, running the cluster in this mode seems an issue because as per the documentation, sparkContext is not supported for this cluster type starting with runtime 14.0 and above.

https://learn.microsoft.com/en-us/azure/databricks/compute/access-mode-limitations

So, the following line of code errors out since _sc is not available.

driver_manager = spark_session._sc._gateway.jvm.java.sql.DriverManager

I have looked around the documentation but haven't seen anything to replace this code so it can run inside a UC enabled cluster. I could use a dataframe to insert the data into Azure SQL but that becomes tricky when I want to return something back, e.g. the newly inserted identity value from that operation. Also, there are additional concerns with the dataframe approach like the code structure, difficulty in reusing, etc.

I wanted to know if there is a different approach to achieve the below features in UC enabled clusters running on 15.3 runtime or above. Also, are there any plans of supporting sparkContext in future versions of runtime. If so, I can just wait for the supported runtime to be released. If there are no plans, then I will need to find an alternate way.

I appreciate any help in this matter.

def get_sqlserver_jdbc_connection(spark_session, server_name, database_name):

driver_manager = spark_session._sc._gateway.jvm.java.sql.DriverManager

jdbc_url = "MY DB URL"

connection = driver_manager.getConnection(jdbc_url)

return connection

# Execute SQL Query with parameters

connection = get_sqlserver_jdbc_connection(spark_session = spark_session, server_name = server_name, database_name = database_name)

statement = connection.prepareStatement(sql)

statement.setInt(1, int(self.application_id))

statement.setString(2, current_date_time)

statement.setString(3, current_date_time)

if self.application_execution_context is None:

statement.setNull(4, JAVA_SQL_TYPE_STRING)

else:

statement.setString(4, self.application_execution_context[0:100])

resultset = statement.executeQuery()

resultset.next()

application_execution_id = resultset.getInt(1)

connection.close()

# Execute stored procedure

connection = get_sqlserver_jdbc_connection(spark_session = spark_session, server_name = server_name, database_name = database_name)

sql = f"exec {stored_procedure_name}"

statement = connection.prepareStatement(sql)

statement.execute()

connection.close()

Re: sparkContext in Runtime 15.3

rushi29 — Wed, 07 Aug 2024 16:17:40 GMT

Thanks @Retired_mod for your response. Since, I also need to call stored procedures in the Azure SQL databases from Azure Databricks, I don't think the DataFrames solution would work. When using py4j, how would I create a connection object in Azure Databricks? I tried with various sample code online but none of them worked. I was trying something similar as below. Am I missing anything?

Thanks,

Rushi

from pyspark.sql import SparkSession

from py4j.java_gateway import java_import, JavaGateway, GatewayParameters

spark = SparkSession.builder.appName("AzureSQLStoredProcedure").getOrCreate()

gateway = JavaGateway()

print (gateway.jvm.DriverManager)

jdbc_url = "jdbc:sqlserver://xxx.database.windows.net:1433;databaseName=xxx;authentication=ActiveDirectoryMSI"

connection = gateway.jvm.DriverManager.getConnection(jdbc_url)

print (connection)

This code gives an error

Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:25333)

File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:982, in GatewayClient._get_connection(self) 981 try: --> 982 connection = self.deque.pop() 983 except IndexError:

File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1177, in GatewayConnection.start(self) 1174 msg = "An error occurred while trying to connect to the Java "\ 1175 "server ({0}:{1})".format(self.address, self.port) 1176 logger.exception(msg) -> 1177 raise Py4JNetworkError(msg, e)

Re: sparkContext in Runtime 15.3

jayct — Tue, 24 Sep 2024 07:43:57 GMT

Hi @rushi29 , did you ever get a solution to this?

@Retired_mod there never was a response to the issue there

Re: sparkContext in Runtime 15.3

Gangster — Wed, 19 Feb 2025 11:57:55 GMT

do you guys found any solutions for this problem? @rushi29 @jayct

Re: sparkContext in Runtime 15.3

rushi29 — Wed, 19 Feb 2025 12:26:06 GMT

Hi @jayct , @Gangster ,

The "Shared Compute" in Unity Catalog does not support the sparkContext as per the docs and is still the case with the latest version. We ended up using Personal Compute clusters for each of our developers and run the jobs under "Job Compute" in production.

Re: sparkContext in Runtime 15.3

jayct — Wed, 19 Feb 2025 14:03:54 GMT

@rushi29 @Gangster

I ended up implementing pyodbc with the mssql driver using init scripts.

Spark context is no longer usable in shared compute so that was the only approach we could take.