Summary:
We have a weird behavior with structs that we have been trying (unsuccessfully) to track down. We have a struct column in a silver table that should only have data for 1 in every 500 records. It's normally null. But for about 1 in every 50 records, instead of getting a null we get a struct with all null properties. {"id":null, "choice":null, "flag":null}.
Question: Has anyone else come across this type of behavior?
Additional Details:
Runtime:14.3 We are reading a chunk of json from a string column in bronze and then parsing it with from_json in our silver notebook. When we checked the input data from bronze, in every case the property was simply not present. So we know this isn't a json problem. Our input serializer only adds the property when there is data. Our records can change so all versions of the data are kept in bronze and the newest version of the data is merged into silver. We initially thought that that this was the result of a merge. Every row with a null only had one version (merge statement executed an insert) and every row with a struct of all null properties had multiple versions (merge statement executed an update). We setup a small scale mockup of the merge with various scenarios and we aren't able to reproduce the issue. In every case, the merge was properly writing either null or a struct with nulls based on what we provided in the input dataframe. It feels like merge should be the culprit here because the rest of the microbatch code has no idea if this is the first version or fifth version of the row. It just transforms the json and the transforms should be idempotent.
So, we are out of ideas on what else we should try to reproduce this. The only other thing we considered is if something different happens with the schema evolution in the dataframes if every row in the microbatch has a null for this column vs if one row has a value and the others do not. This is a very sparse data feed. The majority of the time we are only getting one row in our microbatch.