Databricks Community

TylerTamasaucka · ‎11-18-2019

I am trying to create a JAR for a Azure Databricks job but some code that works when using the notebook interface does not work when calling the library through a job. The weird part is that the job will complete the first run successfully but on any subsequent runs, it will fail. I have to restart my cluster to get it to run and then it will fail again on the second run.

I have created a view on a dataframe :

val df = spark.read.parquet(path)
df.createOrReplaceTempView("table1")

However, when I go to query the view with an aggregate function it yields an error:

val get_max_id_array = spark.sql("SELECT MAX(%s) FROM table1".format(get_id_column_array(0))).first()

Error:

ERROR Uncaught throwable from user code: org.apache.spark.sql.AnalysisException: Undefined function: 'MAX'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7

shyam_9 · ‎11-18-2019

Hi @Tyler Tamasauckas,

Please try as max(df("column_name")) please have look at below blog post regarding max function

https://www.programcreek.com/scala/org.apache.spark.sql.functions.max

omprakash_scala · ‎02-27-2020

Hi @Tyler Tamasauckas ,

I was also facing same issue with the sql functions 'upper' and ‘hash’.

In the jar we have to call SparkSession.builder().getOrCreate() or SparkContext.getOrCreate() API to get the spark/sparkcontext instance.

In the jar if we use object and main() method approach, upon using for the first time it works fine, later on it is somehow .. strangely losing the instance. Don't know the exact reason for that.

The work around is to use “object .. extends App” approach in the jar, then it is working.

The App trait approach is taking 10 seconds more time when compared to object with main method. This is for the first time only, that too for the first activity. It is because the App trait uses delayed initialization feature. Applies to all Scala Applications.

If we still need to use main method approach, define spark instance as implicit and use that implicit wherever we use that instance.

e.g.

object SomeName {

def UserDefinedMethod(query:String)(implicit spark:SparkSession) = {spark.sql(query)} // This UserDefinedMethod gets spark implicitly.

def main(args: Array[String]): Unit = {

implicit val spark = SparkSession.builder().getOrCreate()

spark…

}

Note: Object extends App will get the arguments from Scala 2.9 onward.

Windoze · ‎05-16-2022

Hi, @omprakash.scala@gmail.com

Could you please tell more about the issue you had and its solution?

We now have a similar problem, a job failed on the second run with the exception "Undefined function: to_unix_timestamp. This function is neither a built-in/temporary function..." and the only fix is to restart the cluster, I tried to change my main class to "object ... extends App" approach but it still didn't work.

I searched over the internet and found this post is the only possible clue, looking forward for your response.

Thanks,

Chen

skaja · ‎10-12-2022

I am facing similar issue when trying to use from_utc_timestamp function. I am able to call the function from databricks notebook but when I use the same function inside my java jar and running as a job in databricks, it is giving below error.

AnalysisException: Undefined function: from_utc_timestamp. This function is neither a built-in/temporary function, nor a persistent function that is qualified as spark_catalog.default.from_utc_timestamp.;