cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

CDC Delta table select using startingVersion on Shared cluster running DBR 14.3 does not work

Jaris
New Contributor III

Hello everyone,

We have switched from DBR 13.3 to 14.3 on our Shared development cluster and I am no longer able to run following read from a delta table with CDC enabled:

data = ( 
    spark.read.format("delta")
	.option("readChangeFeed", "true")
	.option("startingVersion", x)
	.option("endingVersion", x)
	.table(f"bronze.{table_name}")
	.select("GJAHR")
)

The same select works fine on single user cluster with DBR 14.3, on shared cluster with DBR 13.3, as well as when I use following SQL equivalent on shared cluster with DBR 14.3:

SELECT "GJAHR"
    FROM table_changes('bronze.{table_name}', x, x)

The issue seems that it cannot somehow match the selected field to what is available in the table. If I run the code without the .select("GJAHR"), it works fine. Also If I select only the CDC fields like _commit_version all runs well. Here is excerpt from the error message produced by the first code snippet:

AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved attribute(s) "GJAHR" missing from "RCLNT", "RLDNR", "RBUKRS", "GJAHR", ...
!Project [GJAHR#72222]. Attribute(s) with the same name appear in the operation: "GJAHR".
Please check if the right attribute(s) are used. SQLSTATE: XX000;
Aggregate [count(1) AS count(1)#72723L]
+- !Project [GJAHR#72222]
   +- ...
      +- Relation snpdwh.bronze_sap.acdoca[RCLNT#72724,RLDNR#72725,RBUKRS#72726,GJAHR#72727,...

DBR 14.3 is not be in beta anymore, so all should work fine. Type of compute (except of the mentioned access mode) plays no role. The Databricks is hosted on Azure.

Is this a bug or do you see any errors in my logic?

Thanks.

3 REPLIES 3

Jaris
New Contributor III

Hello Kaniz,

Thank you for your comprehensive answer.

Unfortunately none of those points apply to my case. The selected column is present exactly once in the source table and there is no more code, this is all I am running to reproduce the issue:

data = ( spark.read.format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", 161)
    .option("endingVersion", 161)
    .table("table_name")
    .select("GJAHR")
)
data.count()

I just switch between 2 computes, one Single User and one Shared, both running on the same DBR 14.3, and I get the error only with the Shared cluster.

Thank you.

Jaris
New Contributor III

Hello Kaniz,

Thanks again for your effort.

I have tried everything, except the column alias in this form, but that didn't help either.

Cluster settings are also not an issue. Just to be sure, I have created a new cluster, left everything on default and only changed the DBR to 14.3. On Single user mode the code runs seamlessly. When I change only the access mode to Shared and restart, the issue appears.

If you have access to Databricks instance, the issue should be pretty easy to replicate.

I am pretty sure at this point, this is a bug.

Jaris
New Contributor III

Hello Kaniz,

Is it possible to report this bug? For my case there are multiple ways I've mentioned above how can I work around, but it would be helpful to have that fixed in the future.

Thank you.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group