Databricks Community

pokus · ‎03-21-2023

I need to use DeltaLog class in the code to get the AddFiles dataset. I have to keep the implemented code in a repo and run it in databricks cluster.

Some docs say to use org.apache.spark.sql.delta.DeltaLog class, but it seems databricks gets rid of it in runtime and i have NoClassDefFoundError: org/apache/spark/sql/delta/DeltaLog$ when run on cluster using

val files = org.apache.spark.sql.delta.DeltaLog.forTable(spark, path(db, table))
        .unsafeVolatileSnapshot
        .allFiles

as i use Provided config for "io.delta" %% "delta-core" dependency, when i try running without Provided, have the exception IllegalArgumentException: requirement failed: Config entry spark.databricks.delta.timeTravel.resolveOnIdentifier.enabled already registered!

databricks https://kb.databricks.com/en_US/sql/find-size-of-table say to use com.databricks.sql.transaction.tahoe.DeltaLog but this class is outside the io.delta package which is the cause of the compilation issue. I even can't define the jar(source of com.databricks.sql.transaction.tahoe.DeltaLog) to import it explicitly into my build. this code works in the cluster

val deltaTable = DeltaTable.forPath(spark, path)
deltaTable.getClass.getMethod("deltaLog").invoke(deltaTable)
      .asInstanceOf[com.databricks.sql.transaction.tahoe.DeltaLog]
      .snapshot
      .allFiles

but as i said i can't keep it in my code because of compilation issue

How can I use DeltaLog in my code and have the possibility to run this code on a cluster?

pokus · ‎03-22-2023

i was able to resolve the issue using the reflection only

val deltaTable = DeltaTable.forPath(spark, path(db, table))
val deltaLog = deltaTable.getClass.getMethod("deltaLog").invoke(deltaTable)
val snapshot = deltaLog.getClass.getMethod("unsafeVolatileSnapshot").invoke(deltaLog)
val allFiles = snapshot.getClass.getMethod("allFiles").invoke(snapshot).asInstanceOf[DataFrame]

but it's will be good to resolve the dependency issue and have the possibility to get DeltaLog using the delta api

View solution in original post

pokus · ‎03-22-2023

i was able to resolve the issue using the reflection only

val deltaTable = DeltaTable.forPath(spark, path(db, table))
val deltaLog = deltaTable.getClass.getMethod("deltaLog").invoke(deltaTable)
val snapshot = deltaLog.getClass.getMethod("unsafeVolatileSnapshot").invoke(deltaLog)
val allFiles = snapshot.getClass.getMethod("allFiles").invoke(snapshot).asInstanceOf[DataFrame]

but it's will be good to resolve the dependency issue and have the possibility to get DeltaLog using the delta api

dbal · ‎03-18-2024

Thanks for providing a solution @pokus .

What I dont understand is why Databricks cannot provide the DeltaLog at runtime. How can this be the official solution? We need a better solution for this instead of depending on reflections.

Databricks Community

use DeltaLog class in databricks cluster

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!