Databricks Community

Michael_Appiah · ‎01-05-2024

Spark 3.4 introduced parameterized SQL queries and Databricks also discussed this new functionality in a recent blog post (https://www.databricks.com/blog/parameterized-queries-pyspark)

Problem: I cannot run any of the examples provided in the PySpark documentation: (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.s...)

I literally just copy pasted the examples into a notebook and tried running them but got following error message (Im running DBR 14.1):

If I use regular f-formatted strings it works:

Regardless of which example in the PySpark Doc or from the Databricks blog post I tried - they all resulted in the error shown above. It would only work when I used f-formatted strings. But the whole idea of this new functionality is that we do not have to use f-formatted strings anymore as they can present a SQL injection vulnerability.

Did anyone get this new parameterized SQL functionality to work? or am I missing something here?

Michael_Appiah · ‎01-09-2024

Hi @Retired_mod ,

thank you for your quick answer.

So DBR14.1 actually includes Spark 3.5.0. I will test with DBR13.3 LTS however and see if that solves the problem. Maybe the issue is caused by Spark Connect which was introduced for shared clusters in DBR14.0. Thanks for the hint.

Michael_Appiah · ‎01-20-2024

Hi @Retired_mod

I just ran a couple of tests with the parameterized spark.sql() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.s... query examples from the documentation with following results:

Parameterized spark.sql() works for:
- single user clusters with DBR 13.3 LTS, 14.0 and 14.2
- shared access mode clusters with DBR 13.3 LTS

Parameterized spark.sql() does NOT work for
- shared access mode clusters with DBR 14.0 and 14.2

So it seems as if the shared access mode does not fully support parameterized queries yet. If I remember correctly, shared access mode clusters introduced spark connect start DBR14.0 I assume that this is the issue. Are there plans to support parameterized spark.sql() for shared access mode cluster with DBR > 13.3 in the future?

Cas · ‎05-10-2024

Thanks for the clarification @Michael_Appiah, very helpfull! Is there already a timeline when this will be supported in DBR 14.x ? As alternatives are not sql injection proof enough for us.

Michael_Appiah · ‎05-15-2024

@Cas Unfortunately I do not have any information on this. However, I have seen that DBR 14.3 and 15.0 introduced some changes to spark.sql(). I have not checked whether those changes resolve the issue outlined here. Your best bet is probably to go ahead and try with the DBR 15.1 (or 15.2 which is in Beta). Maybe @Retired_mod has more information on any future plans to support parameterized spark.sql for shared access mode Clusters with DBR > 13.3?

adriennn · ‎09-18-2024

Can confirm it's working again, tested on a job cluster with DBR 15.4 LTS. It failed on 14.3 LTS.

alex0sp · ‎01-09-2025

Did any one make paramtriezed spark.sql() work with Delta Live Tables? Using DLT pipelines, I get the error:

"TypeError: _override_spark_functions.<locals>._dlt_sql_fn() got an unexpected keyword argument 'bound1'"

I checked, that the cluster the pipeline is using has Runtime 15.4.6, which should be using the Spark version 3.5.0, as stated here: https://docs.databricks.com/en/release-notes/runtime/index.html

Cant find any source, that would tell me this functionality is not available for Delta Live Tables.

Alberto_Umana · ‎01-09-2025

Hi @alex0sp,

The error message you are encountering, TypeError: _override_spark_functions.<locals>._dlt_sql_fn() got an unexpected keyword argument 'bound1', suggests that there might be an issue with how the parameters are being passed to the spark.sql() function within the DLT pipeline. Which parameters are you passing to the function?

alex0sp · ‎01-09-2025

Hello Alberto,

thanks for the quick answer! Actually I want to pass a dataframe to the function, like:

@Dlt.table(
  name="test"
)
def create_table():
  test_df = spark.createDataFrame(["9","10","11","13"], "string").toDF("id")
  
  final_df = spark.sql("SELECT * FROM {df} WHERE id > 9", df=test_df)
   
  return final_df

TypeError: _override_spark_functions.<locals>._dlt_sql_fn() got an unexpected keyword argument 'df',

And I get the same error when trying to only pass an integer for testing purposes, like

@Dlt.table(
  name="test"
)
def create_table():
  test_df = spark.sql("SELECT * FROM range(10) WHERE id > {bound1} AND id < {bound2}", bound1=7, bound2=9)
   
  return test_df

TypeError: _override_spark_functions.<locals>._dlt_sql_fn() got an unexpected keyword argument 'bound1',

Both spark.sql() statements work outside a DLT pipeline. So it seems like this is an DLT specific issue.

adriennn · ‎01-09-2025

This notation is also not supported in Serverless. If you want to share the problem you are trying to solve there is probably another solution.

alex0sp · ‎01-09-2025

The problem I want to solve:

On-premise I have a ton of complicated SQL code in my gold layer using temp tables for intermediate results. No way around that.

I want to migrate the gold layer to DLT. I thought the best way to do this, is to use parameterized spark.sql() instead of creating a temporary view everytime for those intermediate results.

Also im outsourcing each tables code into .py files and then importing them inside the DLT notebook, to not blow up the notebook with thousands of lines of code.

Is there a better approach for comlpicated gold layer logic ind DLT?

adriennn · ‎01-09-2025

How I would approach the problem:

1. do you really need to migrate your logic from onprem to cloud? if not, use query federation to query your "gold" on-prem (dw?), once the onprem dw has reached end of life, no need for complex and potentially wrong re-implementation of sql

2.if 1. is not possible and you still want to run the same sql in the cloud, assuming the sql is ansi compatible...), I would put all the sql queries in a config file as text strings and replace the temp tables references with proper DLT notation (assuming tsql, e.g. #mytemp1 with live.mytemp1) then use a python script to produce the DLT pipeline by looping through the config (DLT will figure out the order and graph by itself). I would not recommend this for the simple reason that this will prevent a lot of optimizations (partitioning, clustering, watermarking, streaming etc ...)

3. I you have a lot of "gold" code onprem, I'm assuming you are talking about a legacy entreprise data model (probably not datavault though), in which case the best solution is, if the data source remain the same, to do a domain-based approach to building the dw as a lakehouse in the cloud with a medallion architecture (assuming you have the first layers in cloud as well, or are they onprem?). Once you have bronze and silver tables (usually these can be easily automated with configuration files), you can build the gold layer as data domains, which means dividing the tons of complicated onprem code in manageable chunks which you can independantly build specifically for the cloud. Make one or more DLT pipeline for each domain and then you can use jobs if the DLT pipelines have interdependencies. At some point, delta table triggers are coming to databricks as well in addition to schedule and file triggers, so you'll be able to update downstream tables as soon as an upstream table update is completed. Or you can use the REST api to produce the job that maintains interdepencies, so that also saves you a lot of clicks.

alex0sp · ‎01-10-2025

Hey adriennn, thanks for the long answer.

So 1 is just not an optoin. the goal is to do a whole migration.

Option 2 would mean that all those temporary tables/views I make to create a gold table would become permanent through the use of LIVE.temp1, wouldnt they?

Option 3 is basically what I am trying to do, migrating an on-prem DWH into a databricks Lakehouse. Bronze and Silver layers are already successfully migrated into databricks. Also I already translated all the gold TSQL to spark SQl and split the domains/data sources into their own DLT pipelines.

Now my best guess (I am new to databricks) for the gold layer is to save that spark SQL (with temporary views instead of parameterized spark.sql()) code into .py files for each table and import them in the DLT pipeline notebook. Do I loose any DLT features with this approach?
Maybe one day I can replace the temporary views with parameterized spark.sql(), when it becomes available to serverless clusters? Or is it possible to use dedicated clusters for DLT?

adriennn · ‎01-13-2025

option 2 can be done with TEMPORARY LIVE VIEWs (or TEMPORARY STREAMING TABLE) over a unity catalog table, so not "permanent" I guess.

> for the gold layer is to save that spark SQL code into .py files for each table and import them in the DLT pipeline notebook. Do I loose any DLT features with this approach?

DLT does what you tell it to do, it doesn't necessary care how you tell it. So whether your SQL code is brought to the pipeline by a python script or by *.sql files ... you could have a single huge *.sql file) or SQL cells in a notebook, the outcome should be the same. Note that the python vs SQL api of DLT is not idempotent (some feature currently only available in python afaik)

> Maybe one day I can replace the temporary views with parameterized spark.sql(), when it becomes available to serverless clusters? Or is it possible to use dedicated clusters for DLT?

I think you mean "job cluster"? Nobody should be running DLT (you can't) or any job on personal clusters if that's what you mean.

Also, I'm not entirely sure what you are parametrizing in your views/temp tables, but you could also produce the parametrized SQL code using f-strings in python. Or perhaps you can even leverage forEachBatch (and thus not necessarily DLT!) to run different logic on the same incoming/updated data. if you really insist on DLT/Serverless, you need to think outside of the "dynamic sql" approach in Databricks.

alex0sp · ‎01-13-2025

thank you for the insides. I will probably be using TEMPORARY STREAMING LIVE VIEW then.

I am parametrizing spark dataframes, so I can do something like this:

spark.sql("SELECT * FROM {df}", df=my_df)

Databricks Community

Parameterized spark.sql() not working

Connect with Databricks Users in Your Area

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure

Securely share data, analytics and AI