How to solve udf performance issue with databricks sql function?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-14-2024 04:14 PM
Hi,
I am dealing with a situation where i need to secure data at rest on storage (azure data lake), hence saving the data as encrypted text into the delta table. While serving, i want to create dynamic view which will be created from the delta table and decrypt it on the fly so that required user can see the decrypted data. I am storing the decryption key into azure KV which is not accessible via azure Databricks. Want to avoid the secret scope route as those user need read access and can see the key. Hence I have created a python whl which will read the decryption key and is obfuscated so that no one can see the code. Now the issue with that is performance as 1>SQL function context don't have SparkSession. I am forced to use normal python code instead of pandas udf as sparksession is not available within sql function. 2> Can't create global UDF so that i can use it from any notebook.
1> Can anyone help how to improve the performance of the udf?
2> Is there any way #databricks enable a> enable sparksession within sql function, b> enable user to create global udf so that it can be accessible from any notebook
Many thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-14-2024 11:03 PM
Have you tried registering Python UDF for Databricks SQL?
These functions are defined and managed as part of Unity Catalog, providing strong and fine-grained management and are available to all users with appropriate permissions. You can call this function as you do for any other built-in SQL function.
Also, you could consider creating a Azure Key Vault-backed secret scope and have MANAGE permission for this secret scope only to the Creator of the secret scope.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-15-2024 12:42 AM
Hi @menotron
Yes, as I mentioned "1>SQL function context don't have SparkSession. I am forced to use normal python code instead of pandas udf as sparksession is not available within sql function. " I have used the Python UDF for Databricks SQL and it is not performing well as I can user pandas udf there to improve the performance. It don't have sparksession.
Also I dont want to use KV or databricks backed secret scope as the user who need access to view also need read access to the secret scope. In that case that user can get the decryption key and share with other user. This is security issue.
Many thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-15-2024 01:34 AM
I don't clearly understand your full problem. But I do know the following regarding UDFs:
1. PySpark UDFs are extremely slow, because it needs to deserialize the Java Object (DataFrame), transform it with the Python UDF, then Serialize it back. This happens on a row by row basis, making it extremely inefficient.
2. Second best approach for Python would be to use Pandas UDF as it works with batches of data, rather than row by row basis, which makes it much faster.
3. Using Scala instead of Python would solve this issue as it doesn't need to deserialize and serialize again, since Scala is Java based language. This would be the most efficient method.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-15-2024 01:53 AM
Hi @pavlosskev
I am using databricks SQL function and wrapping the main code in a python whl to avoid the code expose issue. This is performing badly. That's the issue. I have tried other option as I mentioned in my 1st post. But those options got those limitations. So i am looking for a solution many be from #databricks side to enable them.

