โ03-21-2023 02:23 AM
I need to use DeltaLog class in the code to get the AddFiles dataset. I have to keep the implemented code in a repo and run it in databricks cluster.
Some docs say to use org.apache.spark.sql.delta.DeltaLog class, but it seems databricks gets rid of it in runtime and i have NoClassDefFoundError: org/apache/spark/sql/delta/DeltaLog$ when run on cluster using
val files = org.apache.spark.sql.delta.DeltaLog.forTable(spark, path(db, table))
.unsafeVolatileSnapshot
.allFilesas i use Provided config for "io.delta" %% "delta-core" dependency, when i try running without Provided, have the exception IllegalArgumentException: requirement failed: Config entry spark.databricks.delta.timeTravel.resolveOnIdentifier.enabled already registered!
databricks https://kb.databricks.com/en_US/sql/find-size-of-table say to use com.databricks.sql.transaction.tahoe.DeltaLog but this class is outside the io.delta package which is the cause of the compilation issue. I even can't define the jar(source of com.databricks.sql.transaction.tahoe.DeltaLog) to import it explicitly into my build. this code works in the cluster
val deltaTable = DeltaTable.forPath(spark, path)
deltaTable.getClass.getMethod("deltaLog").invoke(deltaTable)
.asInstanceOf[com.databricks.sql.transaction.tahoe.DeltaLog]
.snapshot
.allFilesbut as i said i can't keep it in my code because of compilation issue
How can I use DeltaLog in my code and have the possibility to run this code on a cluster?
โ03-22-2023 08:29 AM
i was able to resolve the issue using the reflection only
val deltaTable = DeltaTable.forPath(spark, path(db, table))
val deltaLog = deltaTable.getClass.getMethod("deltaLog").invoke(deltaTable)
val snapshot = deltaLog.getClass.getMethod("unsafeVolatileSnapshot").invoke(deltaLog)
val allFiles = snapshot.getClass.getMethod("allFiles").invoke(snapshot).asInstanceOf[DataFrame]but it's will be good to resolve the dependency issue and have the possibility to get DeltaLog using the delta api
โ03-22-2023 08:29 AM
i was able to resolve the issue using the reflection only
val deltaTable = DeltaTable.forPath(spark, path(db, table))
val deltaLog = deltaTable.getClass.getMethod("deltaLog").invoke(deltaTable)
val snapshot = deltaLog.getClass.getMethod("unsafeVolatileSnapshot").invoke(deltaLog)
val allFiles = snapshot.getClass.getMethod("allFiles").invoke(snapshot).asInstanceOf[DataFrame]but it's will be good to resolve the dependency issue and have the possibility to get DeltaLog using the delta api
โ03-18-2024 01:02 PM
Thanks for providing a solution @pokus .
What I dont understand is why Databricks cannot provide the DeltaLog at runtime. How can this be the official solution? We need a better solution for this instead of depending on reflections.
โ10-04-2025 12:41 AM
Hi @pokus ,
You don't need to access via reflection.
You can Access DeltaLog with spark._jvm:
Unity Catalog and DeltaLake tables expose their metadata and transaction log via the JVM backend. Using spark._jvm, you can interact with DeltaLog
Thanks!