Databricks

emanuele_maffeo · ‎03-17-2022

Hi everybody,

Trigger.AvailableNow is released within the databricks 10.1 runtime and we would like to use this new feature with autoloader.

We write all our data pipeline in scala and our projects import spark as a provided dependency. If we try to switch to the 3.2.0 spark version (which databricks 10.1 is based on), we cannot compile our code since Trigger.AvailableOnce is not in this release (at least for the spark open source version). Looking into the github repository seems like this functionality will be released with spark 3.3.

Do we have to wait until the spark 3.3 release?

emanuele_maffeo · ‎03-17-2022

That's fair.

Anyway this feature is basically backported from spark 3.3.0, but since spark 3.3.0 has not been released yet I cannot use it because my code won't compile, hence my whole development process won't work.

In the meantime I've found a ugly hack (using reflection) that allow me to avoid this issue:

val clazz   = Class.forName("org.apache.spark.sql.streaming.Trigger")
    val method  = clazz.getMethod("AvailableNow")
    val trigger = method.invoke(null).asInstanceOf[Trigger]
 
    val streamWriter = df.writeStream
      .format("delta")
      .options(config.sparkWriteOptions)
      .trigger(trigger)

Anyway I guess that this is something that needs to be addressed somehow, in the future there may be other backported features where this workaround won't work.

View solution in original post

Anonymous · ‎03-17-2022

You can switch to python. Depending on what you're doing and if you're using UDFs, there shouldn't be any difference at all in terms of performance.

Anonymous · ‎03-17-2022

Also, it does look like it's available in scala in 10.1 from the release notes

https://docs.databricks.com/release-notes/runtime/10.1.html#triggeravailablenow-for-delta-source-str...

emanuele_maffeo · ‎03-17-2022

Yes, it's available in scala, if I use a scala notebook. But what if I develop my code on a IDE and deploy it to databricks using CD pipelines? is there any chance to have the databricks runtime packaged as jar so that I can use it as a sbt dependency?

Anonymous · ‎03-17-2022

Many things don't work in an IDE such as dbutils and some delta lake features.

We don't release the source code as jars because if we did that AWS would package it and sell it.

emanuele_maffeo · ‎03-17-2022

That's fair.

Anyway this feature is basically backported from spark 3.3.0, but since spark 3.3.0 has not been released yet I cannot use it because my code won't compile, hence my whole development process won't work.

In the meantime I've found a ugly hack (using reflection) that allow me to avoid this issue:

val clazz   = Class.forName("org.apache.spark.sql.streaming.Trigger")
    val method  = clazz.getMethod("AvailableNow")
    val trigger = method.invoke(null).asInstanceOf[Trigger]
 
    val streamWriter = df.writeStream
      .format("delta")
      .options(config.sparkWriteOptions)
      .trigger(trigger)

Anyway I guess that this is something that needs to be addressed somehow, in the future there may be other backported features where this workaround won't work.

Kaniz · ‎03-22-2022

Hi @Emanuele Maffeo , Thank you for sharing the "HACK".

Databricks

Trigger.AvailableNow on scala - compile issue

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs