I need to use DeltaLog class in the code to get the AddFiles dataset. I have to keep the implemented code in a repo and run it in databricks cluster.
Some docs say to use org.apache.spark.sql.delta.DeltaLog class, but it seems databricks gets rid of it in runtime and i have NoClassDefFoundError: org/apache/spark/sql/delta/DeltaLog$ when run on cluster using
val files = org.apache.spark.sql.delta.DeltaLog.forTable(spark, path(db, table))
.unsafeVolatileSnapshot
.allFiles
as i use Provided config for "io.delta" %% "delta-core" dependency, when i try running without Provided, have the exception IllegalArgumentException: requirement failed: Config entry spark.databricks.delta.timeTravel.resolveOnIdentifier.enabled already registered!
databricks https://kb.databricks.com/en_US/sql/find-size-of-table say to use com.databricks.sql.transaction.tahoe.DeltaLog but this class is outside the io.delta package which is the cause of the compilation issue. I even can't define the jar(source of com.databricks.sql.transaction.tahoe.DeltaLog) to import it explicitly into my build. this code works in the cluster
val deltaTable = DeltaTable.forPath(spark, path)
deltaTable.getClass.getMethod("deltaLog").invoke(deltaTable)
.asInstanceOf[com.databricks.sql.transaction.tahoe.DeltaLog]
.snapshot
.allFiles
but as i said i can't keep it in my code because of compilation issue
How can I use DeltaLog in my code and have the possibility to run this code on a cluster?