cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Extract SQL function in SQL Server federated database

Alessio_F
New Contributor

Hi everyone,

I'm using Azure Databricks with a customer who has a SQL Server database federated on the Unity Catalog.

It seems that, while converting some date functions to the SQL Server dialect, Databricks uses the function "extract", which is not recognized by SQL Server.

For example, this query in a SQL cell:

 
WITH cte AS (
  SELECT
    CAST(year(`Date`) AS int) AS RefYear,
    CAST(year(`Date`) AS int) AS RefMonth,
    COUNT(*) AS TabACount
  FROM tabA
  GROUP BY 1, 2
)
SELECT
  CAST(year(tabB.`Date`) AS int) AS RefYear,
  CAST(month(tabB.`Date`) AS int) AS RefMonth,
  MAX(cte.TabACount) AS TabACount,
  COUNT(*) AS TabBCount
FROM tabB
JOIN cte
  ON CAST(year(tabB.`Date`) AS int) = cte.RefYear
 AND CAST(month(tabB.`Date`) AS int) = cte.RefMonth
GROUP BY 1, 2
ORDER BY 1, 2
 
Let's ignore the fact that this query is not well written and there are functions in joins: I've simplified the original query to avoid to publicly reveal details of the customer's database.
The point is that, using "year" and "month" functions in joins gives the following error:
 
com.microsoft.sqlserver.jdbc.SQLServerException: 'EXTRACT' is not a recognized built-in function name. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:265) at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1676) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:620) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:540) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7627) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3916) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:268) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:242) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:459) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:350) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:88) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:87) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:39) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:186) at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:151) at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45) at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104) at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109) at scala.util.Using$.resource(Using.scala:269) at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:145) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:961) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:107) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:964) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:856) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)
 
From the exception stack, it seems that somewhere, between the Spark executor and the com.microsoft.sqlserver.jdbc driver, the YEAR and MONTH functions are converted to the SQL EXECUTE function, which is not supported by SQL Server.
 
Is this a known bug? I'm experiencing it since at least two years.
Any workarounds?
 
Thanks
 
1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @Alessio_F ,

This happens because in Databricks SQL both year and month functions are just aliases over following patterns:

- extract (YEAR FROM expr)

- extract(MONTH FROM expr)

szymon_dybczak_1-1779360372927.png

 

szymon_dybczak_0-1779360293676.png

When Databricks pushes a predicate or expression down to the remote SQL Server via JDBC, it translates these functions back to their canonical EXTRACT(...) form - which is not a recognized in SQL Server's T-SQL dialect.

You can try to rewrite your query or you can try to use remote_query() function which lets you run SQL queries directly against external database using native SQL syntax of th remote system.

Query external databases using the remote_query function | Databricks on Google Cloud

 

If my answer was helpful, please conisder marking it as accepted solution.