<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Accessing DeltaLog and OptimisticTransaction from PySpark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/accessing-deltalog-and-optimistictransaction-from-pyspark/m-p/121909#M46599</link>
    <description>&lt;P&gt;For your consideration:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV class="paragraph"&gt;To interact programmatically with Delta Tables in Unity Catalog via the lower-level transactional APIs, the primary focus is on accessing &lt;CODE&gt;DeltaLog&lt;/CODE&gt; and &lt;CODE&gt;OptimisticTransaction&lt;/CODE&gt; objects. Below are the detailed steps derived from the available guidance and best practices:&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Obtaining &lt;CODE&gt;DeltaLog&lt;/CODE&gt; Instances for Unity Catalog Tables 1. &lt;STRONG&gt;Access DeltaLog with &lt;CODE&gt;spark._jvm&lt;/CODE&gt;:&lt;/STRONG&gt;&lt;BR /&gt;Unity Catalog and DeltaLake tables expose their metadata and transaction log via the JVM backend. Using &lt;CODE&gt;spark._jvm&lt;/CODE&gt;, you can directly interact with &lt;CODE&gt;org.apache.spark.sql.delta.DeltaLog&lt;/CODE&gt;. This involves: - Preparing the API invocation through &lt;CODE&gt;JVM API&lt;/CODE&gt; gateway provided by PySpark. - Passing absolute paths or qualified table names (with catalogs and schemas from Unity Catalog) to create &lt;CODE&gt;DeltaLog&lt;/CODE&gt;. Internally, &lt;CODE&gt;DeltaLog.forTable()&lt;/CODE&gt; can be utilized for interaction through SQL routes or direct filesystem path methods.&lt;/DIV&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;&lt;STRONG&gt;Best Practice for Namespaces:&lt;/STRONG&gt; It’s advised to work on Unity Catalog-enabled configurations where data governance, permissions, and lineage are integrated natively with catalogs. Always use fully-qualified identifiers (e.g., &lt;CODE&gt;catalog.schema.table&lt;/CODE&gt;) in Unity Catalog to ensure consistent naming and governance.&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="paragraph"&gt;Initiating &lt;CODE&gt;OptimisticTransaction&lt;/CODE&gt; Objects 1. &lt;STRONG&gt;Using DeltaLog's &lt;CODE&gt;startTransaction&lt;/CODE&gt;:&lt;/STRONG&gt;&lt;BR /&gt;Once a &lt;CODE&gt;DeltaLog&lt;/CODE&gt; object representing your desired table is accessed, you may start an &lt;CODE&gt;OptimisticTransaction&lt;/CODE&gt;. This transaction object governs ACID transactional control and ensures integrity. - Call &lt;CODE&gt;startTransaction()&lt;/CODE&gt; method on an initialized &lt;CODE&gt;DeltaLog&lt;/CODE&gt; object. - Perform modifications on logical data states using this API before committing data writesa.&lt;/DIV&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;&lt;STRONG&gt;Best Practices for Transactions:&lt;/STRONG&gt;
&lt;UL&gt;
&lt;LI&gt;Always encapsulate write operations in explicit transaction boundaries.&lt;/LI&gt;
&lt;LI&gt;If multiple concurrent transactions are expected, ensure proper isolation using Delta’s built-in conflict resolution mechanisms. Monitor log checkpoints and use manually controlled file locking if transactional barriers need to extend across long spans.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="paragraph"&gt;Maintaining Stability and API Abstraction Separation 1. &lt;STRONG&gt;API Intentional Separation:&lt;/STRONG&gt; Although DeltaLog directly offers raw control over transactional states, DeltaTable API intentionally encapsulates these operations under high-level functions to maintain consistency and backward compatibility for most users. Engaging directly with DeltaLog is reserved for inherently low-level, fine-grained system operations (such as metadata migration or concurrent mutation resolution).&lt;/DIV&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;&lt;STRONG&gt;Backwards Compatibility Concerns for &lt;CODE&gt;DeltaLog&lt;/CODE&gt;:&lt;/STRONG&gt; DeltaLog's core system (Scala-intensive internals) may evolve independently from public Python behaviors to maintain legacy or experimental modes. To integrate carefully:
&lt;UL&gt;
&lt;LI&gt;Follow ongoing deprecation notices strictly (typically published on the Databricks community channels).&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="paragraph"&gt;Pitfalls and Best Practices to Avoid Common Issues 1. &lt;STRONG&gt;Caching Issues:&lt;/STRONG&gt; Improper caching can lead to stale data or metadata mismatches. Use Delta APIs like &lt;CODE&gt;DeltaLog.clearCache()&lt;/CODE&gt; when encountering discrepancies in reads or when retaining manual refreshing ability in Spark clusters.&lt;/DIV&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Conflict Resolution in Transactions:&lt;/STRONG&gt; Handle concurrent operations carefully. Using &lt;CODE&gt;snapshot&lt;/CODE&gt; versions exposed by the DeltaLog API ensures that no accidental overwrites occur without proper checks using &lt;CODE&gt;mergeSchema&lt;/CODE&gt;.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;System Table Compliance:&lt;/STRONG&gt; Unity Catalog suggests adhering to permission rules set within Data Explorer or delta-specific log sharding for efficiency. Avoid creating &lt;CODE&gt;_delta_log&lt;/CODE&gt; folders at non-Delta Table hierarchy root.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="paragraph"&gt;These approaches should offer clarity and precision in achieving transactional controls and extending PySpark efficiencies against core Delta Table system backbones.&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Hope this helps, Lou.&lt;/DIV&gt;</description>
    <pubDate>Mon, 16 Jun 2025 21:09:34 GMT</pubDate>
    <dc:creator>Louis_Frolio</dc:creator>
    <dc:date>2025-06-16T21:09:34Z</dc:date>
    <item>
      <title>Accessing DeltaLog and OptimisticTransaction from PySpark</title>
      <link>https://community.databricks.com/t5/data-engineering/accessing-deltalog-and-optimistictransaction-from-pyspark/m-p/121886#M46588</link>
      <description>&lt;P&gt;Hi community,&lt;/P&gt;&lt;P&gt;I'm exploring ways to perform low-level, programmatic operations on Delta tables directly from a PySpark environment.&lt;/P&gt;&lt;P&gt;The standard &lt;STRONG&gt;delta.tables.DeltaTable&lt;/STRONG&gt; Python API is excellent for high-level DML, but it seems to abstract away the core transactional engine. My goal is to interact with this engine directly.&lt;/P&gt;&lt;P&gt;My research suggests that it might be possible to access the underlying Scala/Java APIs by using the &lt;STRONG&gt;spark._jvm&lt;/STRONG&gt; gateway. I'd like to ask for community guidance on whether this is the correct approach to access the internal &lt;STRONG&gt;org.apache.spark.sql.delta.DeltaLog&lt;/STRONG&gt; and &lt;STRONG&gt;OptimisticTransaction&lt;/STRONG&gt; objects.&lt;/P&gt;&lt;P&gt;Specifically, if &lt;STRONG&gt;spark._jvm&lt;/STRONG&gt; is indeed the right path:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;What is the canonical way to obtain a &lt;STRONG&gt;DeltaLog&lt;/STRONG&gt; instance for a given table, especially considering tables in Unity Catalog?&lt;/LI&gt;&lt;LI&gt;Once a &lt;STRONG&gt;DeltaLog&lt;/STRONG&gt; object is obtained, is calling its &lt;STRONG&gt;startTransaction()&lt;/STRONG&gt; method the standard way to get an OptimisticTransaction?&lt;/LI&gt;&lt;LI&gt;Is the &lt;STRONG&gt;DeltaLog&lt;/STRONG&gt; API intentionally kept separate from the high-level &lt;STRONG&gt;DeltaTable&lt;/STRONG&gt; wrapper to maintain a stable public API?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;I'm essentially looking for the best practices and potential pitfalls when between the PySpark API and the core JVM engine for transactional control.&lt;/P&gt;&lt;P&gt;Any advice or confirmation would be highly appreciated.&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;</description>
      <pubDate>Mon, 16 Jun 2025 15:29:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/accessing-deltalog-and-optimistictransaction-from-pyspark/m-p/121886#M46588</guid>
      <dc:creator>Nasd_</dc:creator>
      <dc:date>2025-06-16T15:29:59Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing DeltaLog and OptimisticTransaction from PySpark</title>
      <link>https://community.databricks.com/t5/data-engineering/accessing-deltalog-and-optimistictransaction-from-pyspark/m-p/121909#M46599</link>
      <description>&lt;P&gt;For your consideration:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV class="paragraph"&gt;To interact programmatically with Delta Tables in Unity Catalog via the lower-level transactional APIs, the primary focus is on accessing &lt;CODE&gt;DeltaLog&lt;/CODE&gt; and &lt;CODE&gt;OptimisticTransaction&lt;/CODE&gt; objects. Below are the detailed steps derived from the available guidance and best practices:&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Obtaining &lt;CODE&gt;DeltaLog&lt;/CODE&gt; Instances for Unity Catalog Tables 1. &lt;STRONG&gt;Access DeltaLog with &lt;CODE&gt;spark._jvm&lt;/CODE&gt;:&lt;/STRONG&gt;&lt;BR /&gt;Unity Catalog and DeltaLake tables expose their metadata and transaction log via the JVM backend. Using &lt;CODE&gt;spark._jvm&lt;/CODE&gt;, you can directly interact with &lt;CODE&gt;org.apache.spark.sql.delta.DeltaLog&lt;/CODE&gt;. This involves: - Preparing the API invocation through &lt;CODE&gt;JVM API&lt;/CODE&gt; gateway provided by PySpark. - Passing absolute paths or qualified table names (with catalogs and schemas from Unity Catalog) to create &lt;CODE&gt;DeltaLog&lt;/CODE&gt;. Internally, &lt;CODE&gt;DeltaLog.forTable()&lt;/CODE&gt; can be utilized for interaction through SQL routes or direct filesystem path methods.&lt;/DIV&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;&lt;STRONG&gt;Best Practice for Namespaces:&lt;/STRONG&gt; It’s advised to work on Unity Catalog-enabled configurations where data governance, permissions, and lineage are integrated natively with catalogs. Always use fully-qualified identifiers (e.g., &lt;CODE&gt;catalog.schema.table&lt;/CODE&gt;) in Unity Catalog to ensure consistent naming and governance.&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="paragraph"&gt;Initiating &lt;CODE&gt;OptimisticTransaction&lt;/CODE&gt; Objects 1. &lt;STRONG&gt;Using DeltaLog's &lt;CODE&gt;startTransaction&lt;/CODE&gt;:&lt;/STRONG&gt;&lt;BR /&gt;Once a &lt;CODE&gt;DeltaLog&lt;/CODE&gt; object representing your desired table is accessed, you may start an &lt;CODE&gt;OptimisticTransaction&lt;/CODE&gt;. This transaction object governs ACID transactional control and ensures integrity. - Call &lt;CODE&gt;startTransaction()&lt;/CODE&gt; method on an initialized &lt;CODE&gt;DeltaLog&lt;/CODE&gt; object. - Perform modifications on logical data states using this API before committing data writesa.&lt;/DIV&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;&lt;STRONG&gt;Best Practices for Transactions:&lt;/STRONG&gt;
&lt;UL&gt;
&lt;LI&gt;Always encapsulate write operations in explicit transaction boundaries.&lt;/LI&gt;
&lt;LI&gt;If multiple concurrent transactions are expected, ensure proper isolation using Delta’s built-in conflict resolution mechanisms. Monitor log checkpoints and use manually controlled file locking if transactional barriers need to extend across long spans.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="paragraph"&gt;Maintaining Stability and API Abstraction Separation 1. &lt;STRONG&gt;API Intentional Separation:&lt;/STRONG&gt; Although DeltaLog directly offers raw control over transactional states, DeltaTable API intentionally encapsulates these operations under high-level functions to maintain consistency and backward compatibility for most users. Engaging directly with DeltaLog is reserved for inherently low-level, fine-grained system operations (such as metadata migration or concurrent mutation resolution).&lt;/DIV&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;&lt;STRONG&gt;Backwards Compatibility Concerns for &lt;CODE&gt;DeltaLog&lt;/CODE&gt;:&lt;/STRONG&gt; DeltaLog's core system (Scala-intensive internals) may evolve independently from public Python behaviors to maintain legacy or experimental modes. To integrate carefully:
&lt;UL&gt;
&lt;LI&gt;Follow ongoing deprecation notices strictly (typically published on the Databricks community channels).&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="paragraph"&gt;Pitfalls and Best Practices to Avoid Common Issues 1. &lt;STRONG&gt;Caching Issues:&lt;/STRONG&gt; Improper caching can lead to stale data or metadata mismatches. Use Delta APIs like &lt;CODE&gt;DeltaLog.clearCache()&lt;/CODE&gt; when encountering discrepancies in reads or when retaining manual refreshing ability in Spark clusters.&lt;/DIV&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Conflict Resolution in Transactions:&lt;/STRONG&gt; Handle concurrent operations carefully. Using &lt;CODE&gt;snapshot&lt;/CODE&gt; versions exposed by the DeltaLog API ensures that no accidental overwrites occur without proper checks using &lt;CODE&gt;mergeSchema&lt;/CODE&gt;.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;System Table Compliance:&lt;/STRONG&gt; Unity Catalog suggests adhering to permission rules set within Data Explorer or delta-specific log sharding for efficiency. Avoid creating &lt;CODE&gt;_delta_log&lt;/CODE&gt; folders at non-Delta Table hierarchy root.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="paragraph"&gt;These approaches should offer clarity and precision in achieving transactional controls and extending PySpark efficiencies against core Delta Table system backbones.&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Hope this helps, Lou.&lt;/DIV&gt;</description>
      <pubDate>Mon, 16 Jun 2025 21:09:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/accessing-deltalog-and-optimistictransaction-from-pyspark/m-p/121909#M46599</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-06-16T21:09:34Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing DeltaLog and OptimisticTransaction from PySpark</title>
      <link>https://community.databricks.com/t5/data-engineering/accessing-deltalog-and-optimistictransaction-from-pyspark/m-p/121953#M46607</link>
      <description>&lt;P&gt;Hi Lou,&lt;/P&gt;&lt;P&gt;Thank you so much for your detailed and insightful response. It really helped clarify the intended architecture and the different APIs (DeltaLog vs. DeltaTable).&lt;/P&gt;&lt;P&gt;I'm trying to programmatically access the low-level Delta Lake APIs by passing throught the java gateway (spark._jvm).&lt;/P&gt;&lt;P&gt;I have run into a persistent issue and, after extensive debugging, I believe it might be specific to this new runtime environment. I would appreciate any insight the community could offer.&lt;/P&gt;&lt;P&gt;Here is a summary of my investigation so far:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Initial Goal:&lt;/STRONG&gt; My objective is to get a DeltaLog object in PySpark by calling the underlying JVM class: spark._jvm.org.apache.spark.sql.delta.DeltaLog&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;The Core Problem:&lt;/STRONG&gt; Every attempt to reference this class fails with the following clear error message:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;Py4JError: org.apache.spark.sql.delta.DeltaLog does not exist in the JVM&lt;/DIV&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Debugging Steps Taken:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Explicit Maven Install:&lt;/STRONG&gt; Since the error indicates the class is not on the classpath, I tried to force its installation directly onto the cluster. I went to the cluster's "Libraries" tab and installed the correct Maven coordinates for DBR 16.4 (which uses Delta 4.0 and Scala 2.12):io.delta:delta-spark_2.13:4.0.0&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Result:&lt;/STRONG&gt; Despite the library showing as "Installed" and after multiple cluster restarts, the exact same Py4JError: ...does not exist in the JVM error persists.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Has the package or class name for DeltaLog been changed or refactored in Delta 4.0 in a way that is not yet documented?&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jun 2025 08:24:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/accessing-deltalog-and-optimistictransaction-from-pyspark/m-p/121953#M46607</guid>
      <dc:creator>Nasd_</dc:creator>
      <dc:date>2025-06-17T08:24:58Z</dc:date>
    </item>
    <item>
      <title>Re: Accessing DeltaLog and OptimisticTransaction from PySpark</title>
      <link>https://community.databricks.com/t5/data-engineering/accessing-deltalog-and-optimistictransaction-from-pyspark/m-p/133777#M49923</link>
      <description>&lt;P&gt;For accessing the Databricks pre-installed package's use&amp;nbsp;&lt;SPAN&gt;spark._jvm.com.databricks.sql.transaction.tahoe.DeltaLog&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;STRONG&gt;org.apache.spark.sql.delta.&lt;A href="http://spark/src/main/scala/org/apache/spark/sql/delta/DeltaLog.scala" target="_self"&gt;DeltaLog&lt;/A&gt;&lt;/STRONG&gt; would be the OSS jar's classname.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 04 Oct 2025 07:28:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/accessing-deltalog-and-optimistictransaction-from-pyspark/m-p/133777#M49923</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-10-04T07:28:28Z</dc:date>
    </item>
  </channel>
</rss>

