cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

use DeltaLog class in databricks cluster

pokus
New Contributor III

I need to use DeltaLog class in the code to get the AddFiles dataset. I have to keep the implemented code in a repo and run it in databricks cluster.

Some docs say to use org.apache.spark.sql.delta.DeltaLog class, but it seems databricks gets rid of it in runtime and i have NoClassDefFoundError: org/apache/spark/sql/delta/DeltaLog$ when run on cluster using

val files = org.apache.spark.sql.delta.DeltaLog.forTable(spark, path(db, table))
        .unsafeVolatileSnapshot
        .allFiles

as i use Provided config for "io.delta" %% "delta-core" dependency, when i try running without Provided, have the exception IllegalArgumentException: requirement failed: Config entry spark.databricks.delta.timeTravel.resolveOnIdentifier.enabled already registered!

databricks https://kb.databricks.com/en_US/sql/find-size-of-table say to use com.databricks.sql.transaction.tahoe.DeltaLog but this class is outside the io.delta package which is the cause of the compilation issue. I even can't define the jar(source of com.databricks.sql.transaction.tahoe.DeltaLog) to import it explicitly into my build. this code works in the cluster

val deltaTable = DeltaTable.forPath(spark, path)
deltaTable.getClass.getMethod("deltaLog").invoke(deltaTable)
      .asInstanceOf[com.databricks.sql.transaction.tahoe.DeltaLog]
      .snapshot
      .allFiles

but as i said i can't keep it in my code because of compilation issue

How can I use DeltaLog in my code and have the possibility to run this code on a cluster?

1 ACCEPTED SOLUTION

Accepted Solutions

pokus
New Contributor III

i was able to resolve the issue using the reflection only

val deltaTable = DeltaTable.forPath(spark, path(db, table))
val deltaLog = deltaTable.getClass.getMethod("deltaLog").invoke(deltaTable)
val snapshot = deltaLog.getClass.getMethod("unsafeVolatileSnapshot").invoke(deltaLog)
val allFiles = snapshot.getClass.getMethod("allFiles").invoke(snapshot).asInstanceOf[DataFrame]

but it's will be good to resolve the dependency issue and have the possibility to get DeltaLog using the delta api

View solution in original post

2 REPLIES 2

pokus
New Contributor III

i was able to resolve the issue using the reflection only

val deltaTable = DeltaTable.forPath(spark, path(db, table))
val deltaLog = deltaTable.getClass.getMethod("deltaLog").invoke(deltaTable)
val snapshot = deltaLog.getClass.getMethod("unsafeVolatileSnapshot").invoke(deltaLog)
val allFiles = snapshot.getClass.getMethod("allFiles").invoke(snapshot).asInstanceOf[DataFrame]

but it's will be good to resolve the dependency issue and have the possibility to get DeltaLog using the delta api

dbal
New Contributor III

Thanks for providing a solution @pokus .

What I dont understand is why Databricks cannot provide the DeltaLog at runtime. How can this be the official solution? We need a better solution for this instead of depending on reflections.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group