cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

sparkContext in Runtime 15.3

rushi29
New Contributor II
Hello All,
 
Our Azure databricks cluster is running under "Legacy Shared Compute" policy with 15.3 runtime. One of the python notebooks is used to connect to an Azure SQL database to read/insert data. The following snippet of code is responsible for running queries to insert/update data in Azure SQL database and to execute stored procedures. All of this works without any issues. However, we have now upgraded our environment to Unity Catalog and we want to start using the Unity Catalog instead of the hive_metastore. To write data to Unity Catalog catalog, the cluster must be running in "Shared Compute" policy and not "Legacy Shared Compute". Unfortunately, running the cluster in this mode seems an issue because as per the documentation, sparkContext is not supported for this cluster type starting with runtime 14.0 and above. 
 
 
So, the following line of code errors out since _sc is not available.
 
driver_manager = spark_session._sc._gateway.jvm.java.sql.DriverManager
 
I have looked around the documentation but haven't seen anything to replace this code so it can run inside a UC enabled cluster. I could use a dataframe to insert the data into Azure SQL but that becomes tricky when I want to return something back, e.g. the newly inserted identity value from that operation. Also, there are additional concerns with the dataframe approach like the code structure, difficulty in reusing, etc.
 
I wanted to know if there is a different approach to achieve the below features in UC enabled clusters running on 15.3 runtime or above. Also, are there any plans of supporting sparkContext in future versions of runtime. If so, I can just wait for the supported runtime to be released. If there are no plans, then I will need to find an alternate way.
 
I appreciate any help in this matter.
 
def get_sqlserver_jdbc_connection(spark_session, server_name, database_name):
    driver_manager = spark_session._sc._gateway.jvm.java.sql.DriverManager
    jdbc_url = "MY DB URL"
    connection = driver_manager.getConnection(jdbc_url)
 
    return connection
 
# Execute SQL Query with parameters
connection = get_sqlserver_jdbc_connection(spark_session = spark_session, server_name = server_name, database_name = database_name)
statement = connection.prepareStatement(sql)
 
statement.setInt(1, int(self.application_id))
statement.setString(2, current_date_time)
statement.setString(3, current_date_time)
if self.application_execution_context is None:
statement.setNull(4, JAVA_SQL_TYPE_STRING)
else:
statement.setString(4, self.application_execution_context[0:100])
 
resultset = statement.executeQuery()
resultset.next()
application_execution_id = resultset.getInt(1)
connection.close()
 
# Execute stored procedure
connection = get_sqlserver_jdbc_connection(spark_session = spark_session, server_name = server_name, database_name = database_name)
sql = f"exec {stored_procedure_name}"
statement = connection.prepareStatement(sql)
statement.execute()
connection.close()
2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @rushi29, Since sparkContext is unavailable in Databricks Runtime 14.0 and above, you can use the py4j library to interact with JVM directly for JDBC connections. Alternatively, you can use DataFrames with the RETURNING clause to handle return values. Currently, Databricks doesn't plan to reintroduce sparkContext support, so it's recommended to use the spark variable with SparkSession. For best practices with the Unity Catalog, check the official documentation.

 

rushi29
New Contributor II

Thanks @Kaniz_Fatma for your response. Since, I also need to call stored procedures in the Azure SQL databases from Azure Databricks, I don't think the DataFrames solution would work. When using py4j, how would I create a connection object in Azure Databricks? I tried with various sample code online but none of them worked. I was trying something similar as below. Am I missing anything?

Thanks,

Rushi

from pyspark.sql import SparkSession
from py4j.java_gateway import java_import, JavaGateway, GatewayParameters

spark = SparkSession.builder.appName("AzureSQLStoredProcedure").getOrCreate()
gateway = JavaGateway()
print (gateway.jvm.DriverManager)
jdbc_url = "jdbc:sqlserver://xxx.database.windows.net:1433;databaseName=xxx;authentication=ActiveDirectoryMSI"
connection = gateway.jvm.DriverManager.getConnection(jdbc_url)
print (connection)
 
This code gives an error
Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:25333)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:982, in GatewayClient._get_connection(self) 981 try: --> 982 connection = self.deque.pop() 983 except IndexError:

File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1177, in GatewayConnection.start(self) 1174 msg = "An error occurred while trying to connect to the Java "\ 1175 "server ({0}:{1})".format(self.address, self.port) 1176 logger.exception(msg) -> 1177 raise Py4JNetworkError(msg, e)

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group