Databricks Community

Jaris · ‎02-16-2024

Hello everyone,

We have switched from DBR 13.3 to 14.3 on our Shared development cluster and I am no longer able to run following read from a delta table with CDC enabled:

data = ( 
    spark.read.format("delta")
	.option("readChangeFeed", "true")
	.option("startingVersion", x)
	.option("endingVersion", x)
	.table(f"bronze.{table_name}")
	.select("GJAHR")
)

The same select works fine on single user cluster with DBR 14.3, on shared cluster with DBR 13.3, as well as when I use following SQL equivalent on shared cluster with DBR 14.3:

SELECT "GJAHR"
    FROM table_changes('bronze.{table_name}', x, x)

The issue seems that it cannot somehow match the selected field to what is available in the table. If I run the code without the .select("GJAHR"), it works fine. Also If I select only the CDC fields like _commit_version all runs well. Here is excerpt from the error message produced by the first code snippet:

AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved attribute(s) "GJAHR" missing from "RCLNT", "RLDNR", "RBUKRS", "GJAHR", ...

!Project [GJAHR#72222]. Attribute(s) with the same name appear in the operation: "GJAHR".
Please check if the right attribute(s) are used. SQLSTATE: XX000;
Aggregate [count(1) AS count(1)#72723L]
+- !Project [GJAHR#72222]
   +- ...
      +- Relation snpdwh.bronze_sap.acdoca[RCLNT#72724,RLDNR#72725,RBUKRS#72726,GJAHR#72727,...

DBR 14.3 is not be in beta anymore, so all should work fine. Type of compute (except of the mentioned access mode) plays no role. The Databricks is hosted on Azure.

Is this a bug or do you see any errors in my logic?

Thanks.

Kaniz_Fatma · ‎02-16-2024

Hi @Jaris, It appears that you’ve encountered an issue when reading from a Delta table with CDC (Change Data Capture) enabled after switching from Databricks Runtime (DBR) 13.3 to 14.3.

Let’s break down the situation and explore potential solutions:

Attribute Mismatch Error: The error message you provided indicates that there’s an issue with attribute resolution. Specifically, it states that the resolved attribute “GJAHR” is missing from other attributes like “RCLNT,” “RLDNR,” and “RBUKRS.” The error suggests that the same attribute name appears in multiple places, causing ambiguity.
Code Comparison: You mentioned that the same select works fine on a single-user cluster with DBR 14.3 and also when using SQL equivalent on a shared cluster with DBR 14.3. However, the issue arises when you explicitly select the “GJAHR” column in your Python code snippet.
Potential Causes and Solutions:
- Attribute Aliasing: Check if there’s any aliasing or renaming of attributes happening elsewhere in your code. For example, if you’re using aliases for columns, ensure that there’s no conflict with the “GJAHR” attribute.
- Column Ambiguity: Verify that the “GJAHR” column exists in the Delta table and that there are no other columns with the same name. If there are, consider using fully qualified column names (e.g., “bronze.{table_name}.GJAHR”) to avoid ambiguity.
- Schema Changes: Confirm that the schema of the Delta table hasn’t changed between DBR versions. If there were any schema modifications (e.g., column additions or deletions), it could impact attribute resolution.
- Delta Table Metadata: Ensure that the Delta table metadata (including the transaction log) is consistent and up-to-date. You can run DESCRIBE DETAIL <table_name> to inspect the table’s location and other details ¹.
Bug or Logic Error?: While it’s challenging to definitively say whether this is a bug or a logic error without examining the complete context, I recommend thoroughly reviewing your code, schema, and any recent changes.
DBR 14.3 Update: As you mentioned, DBR 14.3 is out of beta, and theoretically, it should work seamlessly. However, it’s essential to rule out any specific issues related to your environment, configuration, or table setup.

¹: Databricks Community Forum: AnalysisException: is not a Delta table

Jaris · ‎02-16-2024

Hello Kaniz,

Thank you for your comprehensive answer.

Unfortunately none of those points apply to my case. The selected column is present exactly once in the source table and there is no more code, this is all I am running to reproduce the issue:

data = ( spark.read.format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", 161)
    .option("endingVersion", 161)
    .table("table_name")
    .select("GJAHR")
)
data.count()

I just switch between 2 computes, one Single User and one Shared, both running on the same DBR 14.3, and I get the error only with the Shared cluster.

Thank you.

Kaniz_Fatma · ‎02-16-2024

Hi @Jaris, Thank you for providing additional details. I apologize for the inconvenience you’re experiencing.

Let’s explore further to identify the root cause of the issue.

Given that the attribute “GJAHR” is present exactly once in the source table and there is no additional code, it’s puzzling that you encounter the error only on the Shared cluster.

Here are a few more steps to investigate:

Cluster Configuration:

Confirm that both the Single User and Shared clusters are configured identically (including DBR version, libraries, and environment variables).
Check if there are any differences in the runtime environment that might impact attribute resolution.

Attribute Resolution Order:

Attribute resolution in Spark involves matching column names based on the order of appearance.
Ensure that the attribute “GJAHR” is not being shadowed by any other attributes with the same name (even if they are not explicitly selected).
Verify that there are no conflicting column names in the table schema.

Column Aliasing:

To avoid any ambiguity, consider aliasing the selected column during the projection.
Modify your code snippet as follows:data = ( spark.read.format("delta") .option("readChangeFeed", "true") .option("startingVersion", 161) .option("endingVersion", 161) .table("table_name") .selectExpr("GJAHR AS my_GJAHR") # Alias the attribute ) data.count()

Schema Inspection:

Run the following command to inspect the schema of the Delta table:spark.sql("DESCRIBE DETAIL table_name").show(truncate=False)
Verify that the attribute “GJAHR” is listed correctly and has the expected data type.

Cluster-Specific Behavior:

Sometimes, certain behaviors can be specific to the cluster environment.
Check if there are any cluster-specific configurations or settings that might affect attribute resolution.

Jaris · ‎02-16-2024

Hello Kaniz,

Thanks again for your effort.

I have tried everything, except the column alias in this form, but that didn't help either.

Cluster settings are also not an issue. Just to be sure, I have created a new cluster, left everything on default and only changed the DBR to 14.3. On Single user mode the code runs seamlessly. When I change only the access mode to Shared and restart, the issue appears.

If you have access to Databricks instance, the issue should be pretty easy to replicate.

I am pretty sure at this point, this is a bug.

Kaniz_Fatma · ‎02-18-2024

Hi @Jaris, Given the specific scenario you’ve described, it does indeed sound like an unexpected behaviour or bug within Databricks Runtime 14.3 on shared clusters.

I appreciate your diligence in troubleshooting, and I hope you find a resolution soon. If there’s anything else I can assist with, feel free to ask! 🌟

Jaris · ‎02-19-2024

Hello Kaniz,

Is it possible to report this bug? For my case there are multiple ways I've mentioned above how can I work around, but it would be helpful to have that fixed in the future.

Thank you.

Databricks Community

CDC Delta table select using startingVersion on Shared cluster running DBR 14.3 does not work

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 10 October - 31 October

Databricks Community Social | 30 September 2024 | 8AM PST

Intelligent Data Engineering: Beyond the AI Hype

GenAI: The Shift to Data Intelligence

Big Book of Data Engineering — 3rd Edition