cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

CDC Delta table select using startingVersion on Shared cluster running DBR 14.3 does not work

Jaris
New Contributor III

Hello everyone,

We have switched from DBR 13.3 to 14.3 on our Shared development cluster and I am no longer able to run following read from a delta table with CDC enabled:

data = ( 
    spark.read.format("delta")
	.option("readChangeFeed", "true")
	.option("startingVersion", x)
	.option("endingVersion", x)
	.table(f"bronze.{table_name}")
	.select("GJAHR")
)

The same select works fine on single user cluster with DBR 14.3, on shared cluster with DBR 13.3, as well as when I use following SQL equivalent on shared cluster with DBR 14.3:

SELECT "GJAHR"
    FROM table_changes('bronze.{table_name}', x, x)

The issue seems that it cannot somehow match the selected field to what is available in the table. If I run the code without the .select("GJAHR"), it works fine. Also If I select only the CDC fields like _commit_version all runs well. Here is excerpt from the error message produced by the first code snippet:

AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved attribute(s) "GJAHR" missing from "RCLNT", "RLDNR", "RBUKRS", "GJAHR", ...
!Project [GJAHR#72222]. Attribute(s) with the same name appear in the operation: "GJAHR".
Please check if the right attribute(s) are used. SQLSTATE: XX000;
Aggregate [count(1) AS count(1)#72723L]
+- !Project [GJAHR#72222]
   +- ...
      +- Relation snpdwh.bronze_sap.acdoca[RCLNT#72724,RLDNR#72725,RBUKRS#72726,GJAHR#72727,...

DBR 14.3 is not be in beta anymore, so all should work fine. Type of compute (except of the mentioned access mode) plays no role. The Databricks is hosted on Azure.

Is this a bug or do you see any errors in my logic?

Thanks.

6 REPLIES 6

Kaniz_Fatma
Community Manager
Community Manager

Hi @JarisIt appears that youโ€™ve encountered an issue when reading from a Delta table with CDC (Change Data Capture) enabled after switching from Databricks Runtime (DBR) 13.3 to 14.3.

Letโ€™s break down the situation and explore potential solutions:

  1. Attribute Mismatch Error: The error message you provided indicates that thereโ€™s an issue with attribute resolution. Specifically, it states that the resolved attribute โ€œGJAHRโ€ is missing from other attributes like โ€œRCLNT,โ€ โ€œRLDNR,โ€ and โ€œRBUKRS.โ€ The error suggests that the same attribute name appears in multiple places, causing ambiguity.

  2. Code Comparison: You mentioned that the same select works fine on a single-user cluster with DBR 14.3 and also when using SQL equivalent on a shared cluster with DBR 14.3. However, the issue arises when you explicitly select the โ€œGJAHRโ€ column in your Python code snippet.

  3. Potential Causes and Solutions:

    • Attribute Aliasing: Check if thereโ€™s any aliasing or renaming of attributes happening elsewhere in your code. For example, if youโ€™re using aliases for columns, ensure that thereโ€™s no conflict with the โ€œGJAHRโ€ attribute.
    • Column Ambiguity: Verify that the โ€œGJAHRโ€ column exists in the Delta table and that there are no other columns with the same name. If there are, consider using fully qualified column names (e.g., โ€œbronze.{table_name}.GJAHRโ€) to avoid ambiguity.
    • Schema Changes: Confirm that the schema of the Delta table hasnโ€™t changed between DBR versions. If there were any schema modifications (e.g., column additions or deletions), it could impact attribute resolution.
    • Delta Table Metadata: Ensure that the Delta table metadata (including the transaction log) is consistent and up-to-date. You can run DESCRIBE DETAIL <table_name> to inspect the tableโ€™s location and other details 1.
  4. Bug or Logic Error?: While itโ€™s challenging to definitively say whether this is a bug or a logic error without examining the complete context, I recommend thoroughly reviewing your code, schema, and any recent changes. 

  5. DBR 14.3 Update: As you mentioned, DBR 14.3 is out of beta, and theoretically, it should work seamlessly. However, itโ€™s essential to rule out any specific issues related to your environment, configuration, or table setup.

1: Databricks Community Forum: AnalysisException: is not a Delta table

 

Jaris
New Contributor III

Hello Kaniz,

Thank you for your comprehensive answer.

Unfortunately none of those points apply to my case. The selected column is present exactly once in the source table and there is no more code, this is all I am running to reproduce the issue:

data = ( spark.read.format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", 161)
    .option("endingVersion", 161)
    .table("table_name")
    .select("GJAHR")
)
data.count()

I just switch between 2 computes, one Single User and one Shared, both running on the same DBR 14.3, and I get the error only with the Shared cluster.

Thank you.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Jaris, Thank you for providing additional details. I apologize for the inconvenience youโ€™re experiencing. 

 

Letโ€™s explore further to identify the root cause of the issue.

 

Given that the attribute โ€œGJAHRโ€ is present exactly once in the source table and there is no additional code, itโ€™s puzzling that you encounter the error only on the Shared cluster. 

 

Here are a few more steps to investigate:

 

Cluster Configuration:

  • Confirm that both the Single User and Shared clusters are configured identically (including DBR version, libraries, and environment variables).
  • Check if there are any differences in the runtime environment that might impact attribute resolution.

Attribute Resolution Order:

  • Attribute resolution in Spark involves matching column names based on the order of appearance.
  • Ensure that the attribute โ€œGJAHRโ€ is not being shadowed by any other attributes with the same name (even if they are not explicitly selected).
  • Verify that there are no conflicting column names in the table schema.

Column Aliasing:

  • To avoid any ambiguity, consider aliasing the selected column during the projection.
  • Modify your code snippet as follows:data = (    spark.read.format("delta")    .option("readChangeFeed", "true")    .option("startingVersion", 161)    .option("endingVersion", 161)    .table("table_name")    .selectExpr("GJAHR AS my_GJAHR")  # Alias the attribute ) data.count()

Schema Inspection:

  • Run the following command to inspect the schema of the Delta table:spark.sql("DESCRIBE DETAIL table_name").show(truncate=False)
  • Verify that the attribute โ€œGJAHRโ€ is listed correctly and has the expected data type.

Cluster-Specific Behavior:

  • Sometimes, certain behaviors can be specific to the cluster environment.
  • Check if there are any cluster-specific configurations or settings that might affect attribute resolution.

Jaris
New Contributor III

Hello Kaniz,

Thanks again for your effort.

I have tried everything, except the column alias in this form, but that didn't help either.

Cluster settings are also not an issue. Just to be sure, I have created a new cluster, left everything on default and only changed the DBR to 14.3. On Single user mode the code runs seamlessly. When I change only the access mode to Shared and restart, the issue appears.

If you have access to Databricks instance, the issue should be pretty easy to replicate.

I am pretty sure at this point, this is a bug.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Jaris, Given the specific scenario youโ€™ve described, it does indeed sound like an unexpected behaviour or bug within Databricks Runtime 14.3 on shared clusters.

I appreciate your diligence in troubleshooting, and I hope you find a resolution soon. If thereโ€™s anything else I can assist with, feel free to ask! ๐ŸŒŸ

Jaris
New Contributor III

Hello Kaniz,

Is it possible to report this bug? For my case there are multiple ways I've mentioned above how can I work around, but it would be helpful to have that fixed in the future.

Thank you.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group