Databricks Community

RamenBhar · ‎08-14-2024

Hi,

I am dealing with a situation where i need to secure data at rest on storage (azure data lake), hence saving the data as encrypted text into the delta table. While serving, i want to create dynamic view which will be created from the delta table and decrypt it on the fly so that required user can see the decrypted data. I am storing the decryption key into azure KV which is not accessible via azure Databricks. Want to avoid the secret scope route as those user need read access and can see the key. Hence I have created a python whl which will read the decryption key and is obfuscated so that no one can see the code. Now the issue with that is performance as 1>SQL function context don't have SparkSession. I am forced to use normal python code instead of pandas udf as sparksession is not available within sql function. 2> Can't create global UDF so that i can use it from any notebook.

1> Can anyone help how to improve the performance of the udf?

2> Is there any way #databricks enable a> enable sparksession within sql function, b> enable user to create global udf so that it can be accessible from any notebook

Many thanks

menotron · ‎08-14-2024

Have you tried registering Python UDF for Databricks SQL?
These functions are defined and managed as part of Unity Catalog, providing strong and fine-grained management and are available to all users with appropriate permissions. You can call this function as you do for any other built-in SQL function.

Also, you could consider creating a Azure Key Vault-backed secret scope and have MANAGE permission for this secret scope only to the Creator of the secret scope.

RamenBhar · ‎08-15-2024

Hi @menotron

Yes, as I mentioned "1>SQL function context don't have SparkSession. I am forced to use normal python code instead of pandas udf as sparksession is not available within sql function. " I have used the Python UDF for Databricks SQL and it is not performing well as I can user pandas udf there to improve the performance. It don't have sparksession.

Also I dont want to use KV or databricks backed secret scope as the user who need access to view also need read access to the secret scope. In that case that user can get the decryption key and share with other user. This is security issue.

Many thanks

pavlosskev · ‎08-15-2024

I don't clearly understand your full problem. But I do know the following regarding UDFs:

1. PySpark UDFs are extremely slow, because it needs to deserialize the Java Object (DataFrame), transform it with the Python UDF, then Serialize it back. This happens on a row by row basis, making it extremely inefficient.

2. Second best approach for Python would be to use Pandas UDF as it works with batches of data, rather than row by row basis, which makes it much faster.

3. Using Scala instead of Python would solve this issue as it doesn't need to deserialize and serialize again, since Scala is Java based language. This would be the most efficient method.

RamenBhar · ‎08-15-2024

Hi @pavlosskev

I am using databricks SQL function and wrapping the main code in a python whl to avoid the code expose issue. This is performing badly. That's the issue. I have tried other option as I mentioned in my 1st post. But those options got those limitations. So i am looking for a solution many be from #databricks side to enable them.

Databricks Community

How to solve udf performance issue with databricks sql function?

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks