a month ago - last edited a month ago
Hi community,
I'm exploring ways to perform low-level, programmatic operations on Delta tables directly from a PySpark environment.
The standard delta.tables.DeltaTable Python API is excellent for high-level DML, but it seems to abstract away the core transactional engine. My goal is to interact with this engine directly.
My research suggests that it might be possible to access the underlying Scala/Java APIs by using the spark._jvm gateway. I'd like to ask for community guidance on whether this is the correct approach to access the internal org.apache.spark.sql.delta.DeltaLog and OptimisticTransaction objects.
Specifically, if spark._jvm is indeed the right path:
I'm essentially looking for the best practices and potential pitfalls when between the PySpark API and the core JVM engine for transactional control.
Any advice or confirmation would be highly appreciated.
Thank you!
a month ago
For your consideration:
DeltaLog
and OptimisticTransaction
objects. Below are the detailed steps derived from the available guidance and best practices:DeltaLog
Instances for Unity Catalog Tables 1. Access DeltaLog with spark._jvm
:spark._jvm
, you can directly interact with org.apache.spark.sql.delta.DeltaLog
. This involves: - Preparing the API invocation through JVM API
gateway provided by PySpark. - Passing absolute paths or qualified table names (with catalogs and schemas from Unity Catalog) to create DeltaLog
. Internally, DeltaLog.forTable()
can be utilized for interaction through SQL routes or direct filesystem path methods.catalog.schema.table
) in Unity Catalog to ensure consistent naming and governance.OptimisticTransaction
Objects 1. Using DeltaLog's startTransaction
:DeltaLog
object representing your desired table is accessed, you may start an OptimisticTransaction
. This transaction object governs ACID transactional control and ensures integrity. - Call startTransaction()
method on an initialized DeltaLog
object. - Perform modifications on logical data states using this API before committing data writesa.DeltaLog
: DeltaLog's core system (Scala-intensive internals) may evolve independently from public Python behaviors to maintain legacy or experimental modes. To integrate carefully:
DeltaLog.clearCache()
when encountering discrepancies in reads or when retaining manual refreshing ability in Spark clusters.snapshot
versions exposed by the DeltaLog API ensures that no accidental overwrites occur without proper checks using mergeSchema
._delta_log
folders at non-Delta Table hierarchy root.a month ago
For your consideration:
DeltaLog
and OptimisticTransaction
objects. Below are the detailed steps derived from the available guidance and best practices:DeltaLog
Instances for Unity Catalog Tables 1. Access DeltaLog with spark._jvm
:spark._jvm
, you can directly interact with org.apache.spark.sql.delta.DeltaLog
. This involves: - Preparing the API invocation through JVM API
gateway provided by PySpark. - Passing absolute paths or qualified table names (with catalogs and schemas from Unity Catalog) to create DeltaLog
. Internally, DeltaLog.forTable()
can be utilized for interaction through SQL routes or direct filesystem path methods.catalog.schema.table
) in Unity Catalog to ensure consistent naming and governance.OptimisticTransaction
Objects 1. Using DeltaLog's startTransaction
:DeltaLog
object representing your desired table is accessed, you may start an OptimisticTransaction
. This transaction object governs ACID transactional control and ensures integrity. - Call startTransaction()
method on an initialized DeltaLog
object. - Perform modifications on logical data states using this API before committing data writesa.DeltaLog
: DeltaLog's core system (Scala-intensive internals) may evolve independently from public Python behaviors to maintain legacy or experimental modes. To integrate carefully:
DeltaLog.clearCache()
when encountering discrepancies in reads or when retaining manual refreshing ability in Spark clusters.snapshot
versions exposed by the DeltaLog API ensures that no accidental overwrites occur without proper checks using mergeSchema
._delta_log
folders at non-Delta Table hierarchy root.a month ago
Hi Lou,
Thank you so much for your detailed and insightful response. It really helped clarify the intended architecture and the different APIs (DeltaLog vs. DeltaTable).
I'm trying to programmatically access the low-level Delta Lake APIs by passing throught the java gateway (spark._jvm).
I have run into a persistent issue and, after extensive debugging, I believe it might be specific to this new runtime environment. I would appreciate any insight the community could offer.
Here is a summary of my investigation so far:
Initial Goal: My objective is to get a DeltaLog object in PySpark by calling the underlying JVM class: spark._jvm.org.apache.spark.sql.delta.DeltaLog
The Core Problem: Every attempt to reference this class fails with the following clear error message:
Debugging Steps Taken:
Has the package or class name for DeltaLog been changed or refactored in Delta 4.0 in a way that is not yet documented?
Thank you.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now