Databricks Community

thackman · ‎02-26-2025

Summary:

We have a weird behavior with structs that we have been trying (unsuccessfully) to track down. We have a struct column in a silver table that should only have data for 1 in every 500 records. It's normally null. But for about 1 in every 50 records, instead of getting a null we get a struct with all null properties. {"id":null, "choice":null, "flag":null}.

Question: Has anyone else come across this type of behavior?

Additional Details:

Runtime:14.3 We are reading a chunk of json from a string column in bronze and then parsing it with from_json in our silver notebook. When we checked the input data from bronze, in every case the property was simply not present. So we know this isn't a json problem. Our input serializer only adds the property when there is data. Our records can change so all versions of the data are kept in bronze and the newest version of the data is merged into silver. We initially thought that that this was the result of a merge. Every row with a null only had one version (merge statement executed an insert) and every row with a struct of all null properties had multiple versions (merge statement executed an update). We setup a small scale mockup of the merge with various scenarios and we aren't able to reproduce the issue. In every case, the merge was properly writing either null or a struct with nulls based on what we provided in the input dataframe. It feels like merge should be the culprit here because the rest of the microbatch code has no idea if this is the first version or fifth version of the row. It just transforms the json and the transforms should be idempotent.

So, we are out of ideas on what else we should try to reproduce this. The only other thing we considered is if something different happens with the schema evolution in the dataframes if every row in the microbatch has a null for this column vs if one row has a value and the others do not. This is a very sparse data feed. The majority of the time we are only getting one row in our microbatch.

cgrant · ‎02-27-2025

Here are some strategies for debugging this:

Before you perform each merge, write your source dataframe out as a table, and include the target table's version in the table's name
If possible, enable the change data feed on your table so as to see changes for each version
The next time you witness this behavior, cross-reference the source table that you've written out with the table's changes.

If you can provide a simple reproduction of the issue, you can reply to this thread or reach out to Databricks support for help

Databricks Community

Inconsistant handling of null structs vs strucs with all null values.

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟