topic Re: Correlated column exception in SQL UDF when using UDF parameters. in Data Engineering

Correlated column exception in SQL UDF when using UDF parameters.

Johan_Van_Noten — Mon, 06 Dec 2021 12:38:56 GMT

Environment

Azure Databricks 10.1, including Spark 3.2.0

Scenario

I want to retrieve the average of a series of values between two timestamps, using a SQL UDF.

The average is obviously just an example. In a real scenario, I would like to hide some additional querying complexity behind a SQL UDF.

Tried (and working)

%sql
SELECT avg(temperature) as averageTemperature 
    from oventemperatures 
    where ovenTimestamp
          between to_timestamp("1999-01-01") 
          and to_timestamp("2021-12-31")

And this seems possible in a UDF as well.

CREATE FUNCTION
averageTemperatureUDF(ovenID STRING, startTime TIMESTAMP, endTime TIMESTAMP) 
   RETURNS FLOAT
   READS SQL DATA SQL SECURITY DEFINER
   RETURN SELECT avg(ovenTemperature) as averageTemperature 
          from oventemperatures 
          where ovenTimestamp
                between to_timestamp("1999-01-01") 
                and to_timestamp("2021-12-31")

Tried (and failed)

When I want to use the UDF's parameters in the filter condition, the function definition fails.

CREATE FUNCTION
averageTemperatureUDF(ovenID STRING, startTime TIMESTAMP, endTime TIMESTAMP) 
   RETURNS FLOAT
   READS SQL DATA SQL SECURITY DEFINER
   RETURN SELECT avg(ovenTemperature) as averageTemperature 
          from oventemperatures 
          where ovenTimestamp
                between startTime
                and endTime

The error message complains on "correlated column".

Error in SQL statement: AnalysisException: 
Correlated column is not allowed in predicate 
(spark_catalog.default.oventemperatures.ovenTimestamp >= outer(averageTemperatureUDF.startTime))
(spark_catalog.default.oventemperatures.ovenTimestamp <= outer(averageTemperatureUDF.endTime)):
Aggregate [avg(ovenTemperature#275) AS averageTemperature#9299]
+- Filter ((ovenTimestamp#273 >= outer(startTime#9301)) AND (ovenTimestamp#273 <= outer(endTime#9302)))
   +- SubqueryAlias spark_catalog.default.oventemperatures
      +- Relation default.oventemperatures[ovenTimestamp#273,(...),ovenTemperature#275,(...)] 
	     JDBCRelation(OvenTemperatures) [numPartitions=1]

Question(s)

It seems that it is not accepted to use the UDF's parameters inside the expressions.

Is that a correct conclusion?
Any reason for this limitation?
Any alternatives to work around this?

Re: Correlated column exception in SQL UDF when using UDF parameters.

Anonymous — Mon, 06 Dec 2021 23:45:42 GMT

Hello there, @Johan Van Noten!

My name is Piper and I'm one of the moderators here. Welcome to the community and thanks for your questions. Let's give it a while longer to see how the community responds. We can circle back to this later if we need to. 🙂

Re: Correlated column exception in SQL UDF when using UDF parameters.

Anonymous — Tue, 28 Dec 2021 16:13:17 GMT

@Johan Van Noten - I'm sorry about the delay. Our SMEs are running behind. 😞

I'll bump this back to them. Hang in there.

Re: Correlated column exception in SQL UDF when using UDF parameters.

jose_gonzalez — Wed, 12 Jan 2022 00:54:40 GMT

Hi @Johan Van Noten ,

It seems like there is an issue with the columns that are part of your "where" clause. These columns are part of the UDF itself. I would like to recommend to the check the following docs

1) https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-function.html

2) https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-functions-udf-scalar.html

3) https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-functions-udf-aggregate.html

Hopefully, these link could be helpful to solve this issue.

Thank you.

Re: Correlated column exception in SQL UDF when using UDF parameters.

lav — Tue, 05 Jul 2022 05:21:42 GMT

HI @Johan Van Noten

Were you able to resolve this issue? I am stuck on a same issue. Please if you can guide me if you are able to resolve this issue.

Thank You.

Re: Correlated column exception in SQL UDF when using UDF parameters.

Johan_Van_Noten — Tue, 05 Jul 2022 07:09:18 GMT

Hi Iav

Unfortunately, I've not been able to solve this or work around it.

Based on @Jose Gonzalez ' reply, I've been reading the documents he pointed to and I concluded that this is just not possible in SparkSQL. It significantly disables the usability of UDFs, but I can understand why it doesn't work.

Johan

Re: Correlated column exception in SQL UDF when using UDF parameters.

Robert_Smith — Tue, 05 Jul 2022 09:16:19 GMT

This UDF takes two parameters startDate and tillDate, and returns an integer number. Create a udf in sql server with return type integer.

Re: Correlated column exception in SQL UDF when using UDF parameters.

Johan_Van_Noten — Tue, 05 Jul 2022 13:38:36 GMT

Hi Robert,

Thanks for your suggestion. The fact that I want to do this in SparkSQL is because there is no underlying SQLServer. Well, there might be, but often there isn't. We make use of the possibility of SparkSQL to abstract the underlying data stores (Parquet files, CSV files or SQL data) into a single SQL-alike layer. That is a kind of virtual curation.

In that context, I do not see an option to delegate the UDF to a lower SQL server. Am I overlooking something?

Best regards,

Johan

Re: Correlated column exception in SQL UDF when using UDF parameters.

lav — Tue, 05 Jul 2022 16:40:38 GMT

Hi @Johan Van Noten

I got a work around it. If this helps you. Below is the query I wrote

Target Query:

create or replace function TestAverage(DateBudget Date) RETURNS FLOAT

return select Avg(pd.Amount) as Amount

from TestTable1 as pd

left join TestTable2 er

on er.PK = pd.Exchange_Rate_PK

where sign(DATEDIFF(DateBudget, date_add(to_date(pd.From_Date), -1))) = 1

and sign(DATEDIFF(DateBudget, date_add(to_date(pd.To_Date), 1))) = -1;

Original Query:

select AVG(pd.Amount) as Amount

from TestTable1 as pd

left join TestTable2 er

on er.PK = pd.Exchange_Rate_PK

where fci.DATEBUDGET >= pd.From_Date

and fci.DATEBUDGET <= pd.To_Date

Re: Correlated column exception in SQL UDF when using UDF parameters.

lav — Tue, 05 Jul 2022 16:41:01 GMT

I have used sign function to determine the date diff.

Re: Correlated column exception in SQL UDF when using UDF parameters.

Raghu_Bindingan — Thu, 01 Dec 2022 19:57:54 GMT

Hi @Johan Van Noten ,

Were you able to work around this issue. The UDF does not mind an equivalent operator but does not like it when the predicate has a non equivalent operator like > or <.

Re: Correlated column exception in SQL UDF when using UDF parameters.

Johan_Van_Noten — Fri, 02 Dec 2022 08:39:55 GMT

Hi @Raghu Bindinganavale ,

I wasn't able to work around this and considered it a restriction of SparkSQL. We are coping with it on a higher layer now, which is unfortunate.

Thanks,

Johan

Re: Correlated column exception in SQL UDF when using UDF parameters.

Own — Fri, 02 Dec 2022 12:04:11 GMT

Iav's solution works.

Re: Correlated column exception in SQL UDF when using UDF parameters.

Own — Fri, 02 Dec 2022 12:12:39 GMT

In case of Azure Databricks you can leverage ADF and run this function using SQL Integration runtime while ingesting without having any dependency on lower SQL environment.

Re: Correlated column exception in SQL UDF when using UDF parameters.

Raghu_Bindingan — Fri, 02 Dec 2022 13:11:15 GMT

Hi @Johan Van Noten

I got it to work, its a very different kind of SQL then what i am used to but it works, you could try the below and tweak it to your needs. Basically spark sql does not like predicates that are not equal, so you will need to use a self join to the same table and create a column that equates to your conditions and then join that back to your main table to get the desired result. Very different from our daily SQL 😀

CREATE FUNCTION

averageTemperatureUDF(ovenID STRING, startTime TIMESTAMP, endTime TIMESTAMP)

RETURNS FLOAT

READS SQL DATA SQL SECURITY DEFINER

RETURN SELECT avg(ovenTemperature) as averageTemperature

from oventemperatures base

INNER JOIN (SELECT ovenTimestamp BETWEEN startTime AND endTime AS label, ovenTimestamp FROM oventemperatures) *****

ON base.ovenTimestamp = *****.ovenTimestamp

where *****.label IS true

Re: Correlated column exception in SQL UDF when using UDF parameters.

Johan_Van_Noten — Fri, 02 Dec 2022 14:14:07 GMT

Hi @Raghu Bindinganavale ,

Thanks for your suggestion.

I'm not sure how the SQL engine will optimize this query, but you are essentially calculating the condition-label over all oventemperatures in the full storage (millions) and only afterwards filtering the good from the bad ones in the where clause.

While the oventemperatures (virtual) table is partitioned on ovenTimestamp and should therefore be able to resolve the BETWEEN extremely quickly, I guess that your solution may not run with acceptable performance. Do you think the query optimizer will be able to work its way through this efficiently?

In my scenario description, I indicated that this is just a simplified example to clarify the issue. I guess though that in our general cases, isolating the condition in the way you propose may be possible. Although I remain sceptical about the performance. I'll need to try your proposal once I have the chance to refactor the application.

Best regards,

Johan

Re: Correlated column exception in SQL UDF when using UDF parameters.

Johan_Van_Noten — Fri, 02 Dec 2022 14:26:07 GMT

Thanks guys,

@Lav Chandaria and @Raghu Bindinganavale 's solutions both work, but as indicated in my reply above, I'm worried about the performance of evaluating the datediff (Lav) or the label (Raghu) over the full dataset without the engine being able to just cut away "half" of the potential outcomes based on the single < and information from an index / partition. (see my comment above to Raghu).

I'll try it on a significant dataset once I get the opportunity.

Johan

Re: Correlated column exception in SQL UDF when using UDF parameters.

Raghu_Bindingan — Fri, 02 Dec 2022 14:28:45 GMT

HI @Johan Van Noten

In the instance that i had which was quite simple it did perform ok, but you are right, about performance, this is something that needs to be monitored. I am sure SQL server would go crazy with this approach.

Thanks

Raghu

Re: Correlated column exception in SQL UDF when using UDF parameters.

Rhys — Tue, 24 Jan 2023 17:14:04 GMT

Hi Iav,

With sign I still get Correlated column bug, any thoughts?

CREATE FUNCTION IF NOT EXISTS rw_weekday_diff(start_date DATE, end_date DATE)

RETURNS INT

COMMENT 'Calculated days between two dates excluding weekend'

RETURN DATEDIFF(DAY, start_date, end_date)

- (DATEDIFF(WEEK, start_date, end_date) * 2)

- (CASE WHEN dayofweek(start_date) = 1 THEN 1 ELSE 0 END)

+ (CASE WHEN dayofweek(start_date) = 7 THEN 1 ELSE 0 END)

- (SELECT COUNT(*) FROM bank_holidays AS h WHERE

sign(DATEDIFF(h.date, date_add(to_date(start_date), -1))) = 1 AND

sign(DATEDIFF(h.date, date_add(to_date(end_date), 1))) = -1);

-- original where clause - (SELECT COUNT(*) FROM bank_holidays.uk_bank_holidays AS h WHERE h.date >= start_date AND h.date = end_date);

Re: Correlated column exception in SQL UDF when using UDF parameters.

creastysomp — Fri, 27 Jan 2023 10:52:23 GMT

Thanks for your suggestion. The fact that I want to do this in SparkSQL is because there is no underlying SQLServer.