cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Accessing DeltaLog and OptimisticTransaction from PySpark

Nasd_
New Contributor II

Hi community,

I'm exploring ways to perform low-level, programmatic operations on Delta tables directly from a PySpark environment.

The standard delta.tables.DeltaTable Python API is excellent for high-level DML, but it seems to abstract away the core transactional engine. My goal is to interact with this engine directly.

My research suggests that it might be possible to access the underlying Scala/Java APIs by using the spark._jvm gateway. I'd like to ask for community guidance on whether this is the correct approach to access the internal org.apache.spark.sql.delta.DeltaLog and OptimisticTransaction objects.

Specifically, if spark._jvm is indeed the right path:

  1. What is the canonical way to obtain a DeltaLog instance for a given table, especially considering tables in Unity Catalog?
  2. Once a DeltaLog object is obtained, is calling its startTransaction() method the standard way to get an OptimisticTransaction?
  3. Is the DeltaLog API intentionally kept separate from the high-level DeltaTable wrapper to maintain a stable public API?

I'm essentially looking for the best practices and potential pitfalls when between the PySpark API and the core JVM engine for transactional control.

Any advice or confirmation would be highly appreciated.

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions

BigRoux
Databricks Employee
Databricks Employee

For your consideration:

 

To interact programmatically with Delta Tables in Unity Catalog via the lower-level transactional APIs, the primary focus is on accessing DeltaLog and OptimisticTransaction objects. Below are the detailed steps derived from the available guidance and best practices:
Obtaining DeltaLog Instances for Unity Catalog Tables 1. Access DeltaLog with spark._jvm:
Unity Catalog and DeltaLake tables expose their metadata and transaction log via the JVM backend. Using spark._jvm, you can directly interact with org.apache.spark.sql.delta.DeltaLog. This involves: - Preparing the API invocation through JVM API gateway provided by PySpark. - Passing absolute paths or qualified table names (with catalogs and schemas from Unity Catalog) to create DeltaLog. Internally, DeltaLog.forTable() can be utilized for interaction through SQL routes or direct filesystem path methods.
  1. Best Practice for Namespaces: It’s advised to work on Unity Catalog-enabled configurations where data governance, permissions, and lineage are integrated natively with catalogs. Always use fully-qualified identifiers (e.g., catalog.schema.table) in Unity Catalog to ensure consistent naming and governance.
Initiating OptimisticTransaction Objects 1. Using DeltaLog's startTransaction:
Once a DeltaLog object representing your desired table is accessed, you may start an OptimisticTransaction. This transaction object governs ACID transactional control and ensures integrity. - Call startTransaction() method on an initialized DeltaLog object. - Perform modifications on logical data states using this API before committing data writesa.
  1. Best Practices for Transactions:
    • Always encapsulate write operations in explicit transaction boundaries.
    • If multiple concurrent transactions are expected, ensure proper isolation using Delta’s built-in conflict resolution mechanisms. Monitor log checkpoints and use manually controlled file locking if transactional barriers need to extend across long spans.
Maintaining Stability and API Abstraction Separation 1. API Intentional Separation: Although DeltaLog directly offers raw control over transactional states, DeltaTable API intentionally encapsulates these operations under high-level functions to maintain consistency and backward compatibility for most users. Engaging directly with DeltaLog is reserved for inherently low-level, fine-grained system operations (such as metadata migration or concurrent mutation resolution).
  1. Backwards Compatibility Concerns for DeltaLog: DeltaLog's core system (Scala-intensive internals) may evolve independently from public Python behaviors to maintain legacy or experimental modes. To integrate carefully:
    • Follow ongoing deprecation notices strictly (typically published on the Databricks community channels).
Pitfalls and Best Practices to Avoid Common Issues 1. Caching Issues: Improper caching can lead to stale data or metadata mismatches. Use Delta APIs like DeltaLog.clearCache() when encountering discrepancies in reads or when retaining manual refreshing ability in Spark clusters.
  1. Conflict Resolution in Transactions: Handle concurrent operations carefully. Using snapshot versions exposed by the DeltaLog API ensures that no accidental overwrites occur without proper checks using mergeSchema.
  2. System Table Compliance: Unity Catalog suggests adhering to permission rules set within Data Explorer or delta-specific log sharding for efficiency. Avoid creating _delta_log folders at non-Delta Table hierarchy root.
These approaches should offer clarity and precision in achieving transactional controls and extending PySpark efficiencies against core Delta Table system backbones. 
 
Hope this helps, Lou.

View solution in original post

2 REPLIES 2

BigRoux
Databricks Employee
Databricks Employee

For your consideration:

 

To interact programmatically with Delta Tables in Unity Catalog via the lower-level transactional APIs, the primary focus is on accessing DeltaLog and OptimisticTransaction objects. Below are the detailed steps derived from the available guidance and best practices:
Obtaining DeltaLog Instances for Unity Catalog Tables 1. Access DeltaLog with spark._jvm:
Unity Catalog and DeltaLake tables expose their metadata and transaction log via the JVM backend. Using spark._jvm, you can directly interact with org.apache.spark.sql.delta.DeltaLog. This involves: - Preparing the API invocation through JVM API gateway provided by PySpark. - Passing absolute paths or qualified table names (with catalogs and schemas from Unity Catalog) to create DeltaLog. Internally, DeltaLog.forTable() can be utilized for interaction through SQL routes or direct filesystem path methods.
  1. Best Practice for Namespaces: It’s advised to work on Unity Catalog-enabled configurations where data governance, permissions, and lineage are integrated natively with catalogs. Always use fully-qualified identifiers (e.g., catalog.schema.table) in Unity Catalog to ensure consistent naming and governance.
Initiating OptimisticTransaction Objects 1. Using DeltaLog's startTransaction:
Once a DeltaLog object representing your desired table is accessed, you may start an OptimisticTransaction. This transaction object governs ACID transactional control and ensures integrity. - Call startTransaction() method on an initialized DeltaLog object. - Perform modifications on logical data states using this API before committing data writesa.
  1. Best Practices for Transactions:
    • Always encapsulate write operations in explicit transaction boundaries.
    • If multiple concurrent transactions are expected, ensure proper isolation using Delta’s built-in conflict resolution mechanisms. Monitor log checkpoints and use manually controlled file locking if transactional barriers need to extend across long spans.
Maintaining Stability and API Abstraction Separation 1. API Intentional Separation: Although DeltaLog directly offers raw control over transactional states, DeltaTable API intentionally encapsulates these operations under high-level functions to maintain consistency and backward compatibility for most users. Engaging directly with DeltaLog is reserved for inherently low-level, fine-grained system operations (such as metadata migration or concurrent mutation resolution).
  1. Backwards Compatibility Concerns for DeltaLog: DeltaLog's core system (Scala-intensive internals) may evolve independently from public Python behaviors to maintain legacy or experimental modes. To integrate carefully:
    • Follow ongoing deprecation notices strictly (typically published on the Databricks community channels).
Pitfalls and Best Practices to Avoid Common Issues 1. Caching Issues: Improper caching can lead to stale data or metadata mismatches. Use Delta APIs like DeltaLog.clearCache() when encountering discrepancies in reads or when retaining manual refreshing ability in Spark clusters.
  1. Conflict Resolution in Transactions: Handle concurrent operations carefully. Using snapshot versions exposed by the DeltaLog API ensures that no accidental overwrites occur without proper checks using mergeSchema.
  2. System Table Compliance: Unity Catalog suggests adhering to permission rules set within Data Explorer or delta-specific log sharding for efficiency. Avoid creating _delta_log folders at non-Delta Table hierarchy root.
These approaches should offer clarity and precision in achieving transactional controls and extending PySpark efficiencies against core Delta Table system backbones. 
 
Hope this helps, Lou.

Nasd_
New Contributor II

Hi Lou,

Thank you so much for your detailed and insightful response. It really helped clarify the intended architecture and the different APIs (DeltaLog vs. DeltaTable).

I'm trying to programmatically access the low-level Delta Lake APIs by passing throught the java gateway (spark._jvm).

I have run into a persistent issue and, after extensive debugging, I believe it might be specific to this new runtime environment. I would appreciate any insight the community could offer.

Here is a summary of my investigation so far:

  1. Initial Goal: My objective is to get a DeltaLog object in PySpark by calling the underlying JVM class: spark._jvm.org.apache.spark.sql.delta.DeltaLog

  2. The Core Problem: Every attempt to reference this class fails with the following clear error message:

     Py4JError: org.apache.spark.sql.delta.DeltaLog does not exist in the JVM
  3. Debugging Steps Taken:

    • Explicit Maven Install: Since the error indicates the class is not on the classpath, I tried to force its installation directly onto the cluster. I went to the cluster's "Libraries" tab and installed the correct Maven coordinates for DBR 16.4 (which uses Delta 4.0 and Scala 2.12):io.delta:delta-spark_2.13:4.0.0
    • Result: Despite the library showing as "Installed" and after multiple cluster restarts, the exact same Py4JError: ...does not exist in the JVM error persists.

Has the package or class name for DeltaLog been changed or refactored in Delta 4.0 in a way that is not yet documented?

Thank you.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now