Databricks Community

ivanychev · ‎03-07-2023

After upgrading from DBR 11.3 LTS to DBR 12.2 LTS we started to observe the following error during "read from parquet and write to delta" piece of logic.

AnalysisException: Resolved attribute(s) group_id#72,display_name#73,parent_id#74,path#75,path_list#76 missing from day#178,ac_key#179,group_id#180,display_name#181,parent_id#182,path#183,path_list#184 in operator !Project [empty2null(day#178) AS day#568, empty2null(ac_key#179) AS ac_key#569, group_id#72, display_name#73, parent_id#74, path#75, path_list#76]. Attribute(s) with the same name appear in the operation: group_id,display_name,parent_id,path,path_list. Please check if the right attribute(s) are used.;
WriteIntoDeltaCommand OutputSpec(s3://constructor-analytics-data/tables/delta_prod/item_groups,Map(),ArrayBuffer(day#178, ac_key#179, group_id#180, display_name#181, parent_id#182, path#183, path_list#184))
+- DeltaOptimizedWriter [day, ac_key], com.databricks.sql.transaction.tahoe.DeltaLog@3dc2a8b5, [spark.databricks.delta.optimize.minFileSize=268435456, spark.databricks.delta.autoCompact.maxFileSize=134217728, spark.databricks.delta.optimize.maxFileSize=268435456, spark.databricks.delta.autoCompact.minFileSize=67108864]
   +- DeltaInvariantChecker [Check(EXPRESSION(('day = 2023-03-07)),('day = 2023-03-07)), Check(EXPRESSION(('ac_key = key_ZMdl8uk3o2FQ3Bc9)),('ac_key = key_ZMdl8uk3o2FQ3Bc9))]
      +- !Project [empty2null(day#178) AS day#568, empty2null(ac_key#179) AS ac_key#569, group_id#72, display_name#73, parent_id#74, path#75, path_list#76]
         +- Project [day#164 AS day#178, ac_key#165 AS ac_key#179, group_id#166 AS group_id#180, display_name#167 AS display_name#181, parent_id#168 AS parent_id#182, path#169 AS path#183, path_list#170 AS path_list#184]
            +- Project [day#108 AS day#164, ac_key#116 AS ac_key#165, group_id#124 AS group_id#166, display_name#132 AS display_name#167, parent_id#140 AS parent_id#168, path#148 AS path#169, path_list#156 AS path_list#170]
               +- Project [day#108, ac_key#116, group_id#124, display_name#132, parent_id#140, path#148, path_list#76 AS path_list#156]
                  +- Project [day#108, ac_key#116, group_id#124, display_name#132, parent_id#140, path#75 AS path#148, path_list#76]
                     +- Project [day#108, ac_key#116, group_id#124, display_name#132, parent_id#74 AS parent_id#140, path#75, path_list#76]
                        +- Project [day#108, ac_key#116, group_id#124, display_name#73 AS display_name#132, parent_id#74, path#75, path_list#76]
                           +- Project [day#108, ac_key#116, group_id#72 AS group_id#124, display_name#73, parent_id#74, path#75, path_list#76]
                              +- Project [day#108, ac_key#93 AS ac_key#116, group_id#72, display_name#73, parent_id#74, path#75, path_list#76]
                                 +- Project [day#84 AS day#108, ac_key#93, group_id#72, display_name#73, parent_id#74, path#75, path_list#76]
                                    +- Project [day#84, ac_key#93, group_id#72, display_name#73, parent_id#74, path#75, path_list#76]
                                       +- Project [day#84, key_ZMdl8uk3o2FQ3Bc9 AS ac_key#93, group_id#72, display_name#73, parent_id#74, path#75, path_list#76]
                                          +- Project [2023-03-07 AS day#84, ac_key#71, group_id#72, display_name#73, parent_id#74, path#75, path_list#76]
                                             +- Relation [day#70,ac_key#71,group_id#72,display_name#73,parent_id#74,path#75,path_list#76] parquet

Weird thing here is that at !Project there's group_id#72 but the dependent Project has group_id#180 as if there's some bug in the plan. There's not joins in this pipeline, it's as simple as read + write to delta.

Do you have any idea of what can be wrong here? DeltaOptimizedWriter issue perhaps?

Sergey

Anonymous · ‎03-15-2023

@Sergey Ivanychev : I am providing you a bunch of ideas to think about, please investigate further once you read into it and rule out all the ones that you have done.

It seems like there might be a mismatch between the attributes being referenced in the code and the attributes that are actually available in the data being processed. The error message indicates that the attributes group_id#72, display_name#73, parent_id#74, path#75, path_list#76 are missing from the input table day#178, ac_key#179, group_id#180, display_name#181, parent_id#182, path#183, path_list#184. The mismatch in attribute names is also evident in the plan, where the !Project operator refers to group_id#72, display_name#73, parent_id#74, path#75, path_list#76, but the subsequent

Project operator uses group_id#166, display_name#167, parent_id#168, path#169, path_list#170 . This could be a sign of a bug in the optimizer or a discrepancy in the metadata about the schema of the input data.

You might want to investigate whether the schema of the input data has changed in any way during the upgrade process, or whether there are any other factors that could have caused the attribute names to be mismatched.

You could also try to explicitly specify the schema of the input data when reading it in, to ensure that the attributes are being correctly identified.

It also looks like the error message is indicating that there are missing attributes in the output of a project operation, which is causing a problem when writing to Delta. Specifically, the attributes "group_id", "display_name", "parent_id", "path", and "path_list" appear to be missing from the output of the project operation.

It's possible that this issue is related to the DeltaOptimizedWriter, but it's also possible that the problem is elsewhere in the pipeline. It would be helpful to review the code that is generating this pipeline to see if there are any obvious issues, such as missing or incorrect mappings.

tom_jochums · ‎03-20-2023

I've been encountering a similar issue when upgrading to 12.2, and started stripping back all the changes, to the point where I'm basically running the same exact code in 12.2 and 11.3 and it works in 11.3 but gives a similar error to the original question.

I stripped it back even further to just creating a test that only deals with trying to do a DataTable merge and am able to simulate it. Not sure it would match the situation above perfectly, but it might help with the reproduction of it.

Basically, I've got two different clusters, one running each version that I can switch between, and I've got a scala notebook (attached), it contains a Sql script at the top that generates the Delta Table to write to (that simulates the schema we're using), and then code to create a simple data frame to merge into it.

As I say it seems to run fine in 11.3, but does not in 12.2. Any thoughts on if there is an issue with the merge approach, or ways around it? We use it on a significant portion of our tasks and it prevents us from updating to 12.2, which we're hoping to do for some other things we would gain from it?

Thanks, Tom

Valtor · ‎03-24-2023

Per my testing, a string column used in a partition by clause (which happens to be followed by an integer column at creation) seems to be causing this issue.

So if you try putting it in second position, this seems to work.

CREATE TABLE IF NOT EXISTS TestEmit.Kafka_Emit_Log_Test

(

EnterpriseId INT NOT NULL,

Topic String NOT NULL,

Key String NOT NULL,

Value String NOT NULL

)

USING DELTA

PARTITIONED BY (Topic,EnterpriseId)

LOCATION '/mnt/test/emit/Kafka_Emit_Log_Test';

Valtor · ‎03-24-2023

We are having the same issues. This is 100% definitely a bug in Databricks 12.2. Just changing an existing cluster from 11.3 to 12.2 (without touching anything else) results in those errors: "Resolved attribute(s) ... missing from ..." and "Attribute(s) with the same name appear in the operation".

ivanychev · ‎03-26-2023

Hey Suteja. Are there any plans in Databricks to fix this? This prevents us to update to the next LTS

Sergey

tom_jochums · ‎04-25-2023

For what it's worth, we're using Databricks hosted in azure, and working through support, this problem did get resolved in the latest 12.2 version in that environment for us, not sure if it applies to all hosting environments, but it may be fixed for others who were experiencing the issue on upgrade