Hi,
we are trying to move some of our code from a ‘legacy’ cluster to a ‘Multi-node/ Shared' cluster so that we can start using Unity Catalog. However, we have run into an issue with some of our code, which calls Stored Procedures, on the new cluster. I would appreciate any guidance the community can supply.
Our legacy configuration was a High Concurrency cluster (not configured with Table Access Control). On that Cluster we were using pyodbc to execute stored procedures to Read/ Write and Update tables in an Azure SQL database.
To do this we installed the unixodbc and pyodbc as part of the cluster startup init file (pretty much using the approach outlined in the answer in this thread https://stackoverflow.com/questions/54132249/how-to-install-pyodbc-in-databricks).
Once installed, we happily call a range of stored procedures on the Azure SQL database from the legacy cluster (as mentioned above the procs do all manner of reads, writes and updates).
Last week tried to move our code to a "Multi-node/Shared" cluster linked to Unity Catalog and we have run into an issue: namely we’ve been unable to find a way to execute our stored procedures on the new cluster.
Here is what we have tried and what we found:
Approach 1: Try using pyodbc and install the unix odbc drivers
Whilst the pyodbc library can be imported into the notebook on the new Cluster, when we try and run pyodbc we get the below message which indicates that the required odbc drivers are not available.
When we manually tried to run the install of the odbc drivers we are informed we do not have permissions on the new cluster (fyi - trying to driver install via Global Inits also does not appear to have been successful).
Our working understanding of why this is the case is that the “Multi-node / Shared” clusters are a more locked down configuration and does not allow root level modifications (such as installing new drivers). Does that sound right?
Approach 2: Switch to using Jdbc to run the Stored Procedures
Using our legacy Cluster we adapted the way we execute stored procedures and took pyodbc out of the picture. Instead we executed the stored procedures using jdbc. (The below thread provides an example of how we tried to do this https://stackoverflow.com/questions/60354376/how-to-execute-a-stored-procedure-in-azure-databricks-p...
Whilst we got this approach working on the our legacy cluster, on the "Multi-node / Shared" cluster we got the below error informing we have a package whitelisting issue.
py4j.security.Py4JSecurityException: Method public static java.sql.Connection java.sql.DriverManager.getConnection(java.lang.String,java.lang.String,java.lang.String) throws java.sql.SQLException is not whitelisted on class class java.sql.DriverManager
Approach 3: Use Jdbc with direct Inserts
Using the Multi-node/ Shared cluster we have successfully used jdbc to both read and write to the Azure SQL table however from what we can see there is not way to run an UPDATE using jdbc so this is not an ideal solution but is our active work-around.
Before accepting our fate and moving away from executing stored procedures (something I know many would recommend anyway but will incur quite a bit of effort), I wanted to check whether the community had any ideas?
Thanks in advance