cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to solve udf performance issue with databricks sql function?

RamenBhar
New Contributor II

Hi,

I am dealing with a situation where i need to secure data at rest on storage (azure data lake), hence saving the data as encrypted text into the delta table. While serving, i want to create dynamic view which will be created from the delta table and decrypt it on the fly so that required user can see the decrypted data. I am storing the decryption key into azure KV which is not accessible via azure Databricks. Want to avoid the secret scope route as those user need read access and can see the key. Hence I have created a python whl which will read the decryption key and is obfuscated so that no one can see the code. Now the issue with that is performance as 1>SQL function context don't have SparkSession. I am forced to use normal python code instead of pandas udf as sparksession is not available within sql function. 2> Can't create global UDF so that i can use it from any notebook.

1> Can anyone help how to improve the performance of the udf?

2> Is there any way #databricks enable a> enable sparksession within sql function, b> enable user to create global udf so that it can be accessible from any notebook

Many thanks

 

 

4 REPLIES 4

menotron
Valued Contributor

Have you tried registering Python UDF for Databricks SQL
These functions are defined and managed as part of Unity Catalog, providing strong and fine-grained management and are available to all users with appropriate permissions. You can call this function as you do for any other built-in SQL function.

Also, you could consider creating a Azure Key Vault-backed secret scope and have MANAGE permission for this secret scope only to the Creator of the secret scope. 

RamenBhar
New Contributor II

Hi @menotron 

Yes, as I mentioned "1>SQL function context don't have SparkSession. I am forced to use normal python code instead of pandas udf as sparksession is not available within sql function. " I have used the Python UDF for Databricks SQL and it is not performing well as I can user pandas udf there to improve the performance. It don't have sparksession.

Also I dont want to use KV or databricks backed secret scope as the user who need access to view also need read access to the secret scope. In that case that user can get the decryption key and share with other user. This is security issue.

 

Many thanks

 

pavlosskev
New Contributor III

I don't clearly understand your full problem. But I do know the following regarding UDFs:

1. PySpark UDFs are extremely slow, because it needs to deserialize the Java Object (DataFrame), transform it with the Python UDF, then Serialize it back. This happens on a row by row basis, making it extremely inefficient.

2. Second best approach for Python would be to use Pandas UDF as it works with batches of data, rather than row by row basis, which makes it much faster.

3. Using Scala instead of Python would solve this issue as it doesn't need to deserialize and serialize again, since Scala is Java based language. This would be the most efficient method.

Hi @pavlosskev 

I am using databricks SQL function and wrapping the main code in a python whl to avoid the code expose issue. This is performing badly. That's the issue. I have tried other option as I mentioned in my 1st post. But those options got those limitations. So i am looking for a solution many be from #databricks side to enable them.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group