Hello,
Iām working on Databricks with a cluster running Runtime 16.4, which includes Spark 3.5.2 and Scala 2.12.
For a specific need, I want to implement my own custom way of writing to Delta tables by manually managing Delta transactions from PySpark. To do this, I want to access the Delta Lake transactional engine via the JVM embedded in the Spark session, specifically by using the class:
org.apache.spark.sql.delta.DeltaLog
Issue
When I try to use classes from the package org.apache.spark.sql.delta directly from PySpark (through spark._jvm), the classes are not found if I donāt have the Delta Core package installed explicitly on the cluster.
When I install the Delta Core Python package to gain access, I encounter the following Python import error:
ModuleNotFoundError: No module named 'delta.exceptions.captured'; 'delta.exceptions' is not a package
Without the Delta Core package installed, accessing DeltaLog simply returns a generic JavaPackage object that is unusable.
What I want to do Access the Delta transaction log API (DeltaLog) from PySpark via JVM.
Be able to start transactions and commit manually to implement custom write behavior.
Work within the Databricks Runtime 16.4 environment without conflicts or missing dependencies.
Questions
How can I correctly access and use org.apache.spark.sql.delta.DeltaLog from PySpark on Databricks Runtime 16.4?
Is there a supported way to manually manage Delta transactions through the JVM in this environment?
What is the correct setup or package dependency to avoid the ModuleNotFoundError when installing the Delta Core Python package?
Are there any alternatives or recommended patterns to achieve manual Delta commits programmatically on Databricks?