cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Inconsistant handling of null structs vs strucs with all null values.

thackman
New Contributor III

Summary:

We have a weird behavior with structs that we have been trying (unsuccessfully) to track down.  We have a struct column in a silver table that should only have data for 1 in every 500 records. It's normally null. But for about 1 in every 50 records, instead of getting a null we get a struct with all null properties. {"id":null, "choice":null, "flag":null}.

Question: Has anyone else come across this type of behavior?

Additional Details: 

Runtime:14.3  We are reading a chunk of json from a string column in bronze and then parsing it with from_json in our silver notebook. When we checked the input data from bronze, in every case the property was simply not present. So we know this isn't a json problem.  Our input serializer only adds the property when there is data. Our records can change so all versions of the data are kept in bronze and the newest version of the data is merged into silver. We initially thought that that this was the result of a merge.  Every row with a null only had one version (merge statement executed an insert) and every row with a struct of all null properties had multiple versions (merge statement executed an update).  We setup a small scale mockup of the merge with various scenarios and we aren't able to reproduce the issue. In every case, the merge was properly writing either null or a struct with nulls based on what we provided in the input dataframe. It feels like merge should be the culprit here because the rest of the microbatch code has no idea if this is the first version or fifth version of the row. It just transforms the json and the transforms should be idempotent.

So, we are out of ideas on what else we should try to reproduce this.  The only other thing we considered is if something different happens with the schema evolution in the dataframes if every row in the microbatch has a null for this column vs if one row has a value and the others do not. This is a very sparse data feed. The majority of the time we are only getting one row in our microbatch.

 

 

 

 

1 REPLY 1

cgrant
Databricks Employee
Databricks Employee

Here are some strategies for debugging this:

  • Before you perform each merge, write your source dataframe out as a table, and include the target table's version in the table's name
  • If possible, enable the change data feed on your table so as to see changes for each version
  • The next time you witness this behavior, cross-reference the source table that you've written out with the table's changes.

If you can provide a simple reproduction of the issue, you can reply to this thread or reach out to Databricks support for help

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now