Thanks for your help Alex, I ended up re-writing my code with spark UDFs -- maybe there is a better solution with only the Dataframe API but I couldn't find it.
To summarize my problem: I was trying to un-nest a large json blob (the fake data in my first attached notebook emulates the problem). I have a data source that contains a lot of nested json, and usually I can use use pyspark.sql.functions.get_json_object() to extract the value for the key I want. That doesn't work in this case though because the nested keys are randomized, so I needed a way to take something like this:
"file_key": {
"nested_blob": {
"rand_nested_id_1": {"k1": "v1", "k2": "v2"},
"rand_nested_id_2": {"k1": "v1", "k2": "v2"},
"rand_nested_id_3": {"k1": "v1", "k2": "v2"}
}
}
And turn it into a dataframe that looks like this:
file_key | rand_nested_id_1 | {"k1": "v1", "k2": "v2"}
file_key | rand_nested_id_2 | {"k1": "v1", "k2": "v2"}
file_key | rand_nested_id_3 | {"k1": "v1", "k2": "v2"}
A combination of a spark udf that returns the file_key and nested key, value pairs as an array followed by the explode function did I wanted. Solution attached.