Thursday
I've been working on ingesting csv files with varying schemas using Autoloader. Goal is to take the csvs and ingest them into a bronze table that writes each record as a key-value mapping with only the relevant fields for that record. I also want to maintain the data types of the fields as they are inferred by Autoloader. At first I was ingesting to a wide table, then converting each row to a list of structs, filtering out nulls, converting to a <STRING, STRING> map, and then using to_variant_object to convert to variant. However, the need for type homogeneity on the map values defeated the purpose of saving each record as variant. I recently was informed that the option outlined here: https://learn.microsoft.com/en-us/azure/databricks/ingestion/variant#-ingest-data-from-cloud-object-... could be applied to csvs. I tried it out, and while it did save each record as a variant type key-value mapping, it seems that it took the set of all fields across all ingested csvs and populated a key for every single field for every single record, resulting in the vast majority of values being NULL and causing processing to be super slow. I don't get why it functions this way... the point of variant is to be able to store semi-structured data, right? Here is the code I used, any ideas?
Thursday
What if instead of going the route of using variant you could
Do you feel you are getting any benefit out of using variant here ?
Friday
It's a good question.
The benefit of variant here is that, while we will enforce schemas in silver, there will be many use-cases where users are looking to query columns that we do not have in silver. The reason is that new columns are added by users somewhat on the fly. They create a configuration file that collects sensor measurements from a variety of available sensors and the sensor data are what come through autoloader. If they have a new reading that they added to the config that we haven't explicitly unpacked in a silver table, they will need to query bronze directly, in which case variant is much more user-friendly, especially since these users will not be experienced in Databricks. Conversely, with an ever-expanding wide table format, the vast majority of columns will be NULL for a given row which will make the user experience more confusing.
Friday - last edited Friday
If anyone has any documentation specific to this option with csv that would be awesome.
Friday
If i understand the problem correctly you are getting extra keys for records from files where the keys actually dont exist . I was not able to reproduce this issue . I am getting diffrent keys , value pairs and no extra keys with null.
Can you share similar screenshot of your rows . I am also using Serverless compute . I hope you have the same.
9 hours ago
Thanks for this, it helped me find the real issue. The data WAS bringing in the correct keys. I was building three streaming tables (group a, b, c), so I was using a loop as described here: https://spark.apache.org/docs/latest/declarative-pipelines-programming-guide.html#creating-tables-in.... However, I forgot a critical piece: passing the group into the query function definition. Since the for loop just creates the definitions of the tables, when it came to actually stream the data, group was pointing to "c" for all 3 refreshes, so all 3 tables were streaming in group c data. When I spot-checked a csv file against the table data, I saw the schema mismatch and assumed it was a checkpointing/evolution issue, but it was not.