cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

sparkContext in Runtime 15.3

rushi29
New Contributor III
Hello All,
 
Our Azure databricks cluster is running under "Legacy Shared Compute" policy with 15.3 runtime. One of the python notebooks is used to connect to an Azure SQL database to read/insert data. The following snippet of code is responsible for running queries to insert/update data in Azure SQL database and to execute stored procedures. All of this works without any issues. However, we have now upgraded our environment to Unity Catalog and we want to start using the Unity Catalog instead of the hive_metastore. To write data to Unity Catalog catalog, the cluster must be running in "Shared Compute" policy and not "Legacy Shared Compute". Unfortunately, running the cluster in this mode seems an issue because as per the documentation, sparkContext is not supported for this cluster type starting with runtime 14.0 and above. 
 
 
So, the following line of code errors out since _sc is not available.
 
driver_manager = spark_session._sc._gateway.jvm.java.sql.DriverManager
 
I have looked around the documentation but haven't seen anything to replace this code so it can run inside a UC enabled cluster. I could use a dataframe to insert the data into Azure SQL but that becomes tricky when I want to return something back, e.g. the newly inserted identity value from that operation. Also, there are additional concerns with the dataframe approach like the code structure, difficulty in reusing, etc.
 
I wanted to know if there is a different approach to achieve the below features in UC enabled clusters running on 15.3 runtime or above. Also, are there any plans of supporting sparkContext in future versions of runtime. If so, I can just wait for the supported runtime to be released. If there are no plans, then I will need to find an alternate way.
 
I appreciate any help in this matter.
 
def get_sqlserver_jdbc_connection(spark_session, server_name, database_name):
    driver_manager = spark_session._sc._gateway.jvm.java.sql.DriverManager
    jdbc_url = "MY DB URL"
    connection = driver_manager.getConnection(jdbc_url)
 
    return connection
 
# Execute SQL Query with parameters
connection = get_sqlserver_jdbc_connection(spark_session = spark_session, server_name = server_name, database_name = database_name)
statement = connection.prepareStatement(sql)
 
statement.setInt(1, int(self.application_id))
statement.setString(2, current_date_time)
statement.setString(3, current_date_time)
if self.application_execution_context is None:
statement.setNull(4, JAVA_SQL_TYPE_STRING)
else:
statement.setString(4, self.application_execution_context[0:100])
 
resultset = statement.executeQuery()
resultset.next()
application_execution_id = resultset.getInt(1)
connection.close()
 
# Execute stored procedure
connection = get_sqlserver_jdbc_connection(spark_session = spark_session, server_name = server_name, database_name = database_name)
sql = f"exec {stored_procedure_name}"
statement = connection.prepareStatement(sql)
statement.execute()
connection.close()
2 REPLIES 2

rushi29
New Contributor III

Thanks @Retired_mod for your response. Since, I also need to call stored procedures in the Azure SQL databases from Azure Databricks, I don't think the DataFrames solution would work. When using py4j, how would I create a connection object in Azure Databricks? I tried with various sample code online but none of them worked. I was trying something similar as below. Am I missing anything?

Thanks,

Rushi

from pyspark.sql import SparkSession
from py4j.java_gateway import java_import, JavaGateway, GatewayParameters

spark = SparkSession.builder.appName("AzureSQLStoredProcedure").getOrCreate()
gateway = JavaGateway()
print (gateway.jvm.DriverManager)
jdbc_url = "jdbc:sqlserver://xxx.database.windows.net:1433;databaseName=xxx;authentication=ActiveDirectoryMSI"
connection = gateway.jvm.DriverManager.getConnection(jdbc_url)
print (connection)
 
This code gives an error
Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:25333)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:982, in GatewayClient._get_connection(self) 981 try: --> 982 connection = self.deque.pop() 983 except IndexError:

File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1177, in GatewayConnection.start(self) 1174 msg = "An error occurred while trying to connect to the Java "\ 1175 "server ({0}:{1})".format(self.address, self.port) 1176 logger.exception(msg) -> 1177 raise Py4JNetworkError(msg, e)

 

jayct
New Contributor II

Hi @rushi29 , did you ever get a solution to this?

@Retired_mod there never was a response to the issue there

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group