โ01-18-2023 12:39 AM
Hi All,
I have a deeply nested spark dataframe struct something similar to below
|-- id: integer (nullable = true)
|-- lower: struct (nullable = true)
| |-- field_a: integer (nullable = true)
| |-- upper: struct (containsNull = true)
| | |-- field_A: integer (nullable = true)
| | |-- num: struct (containsNull = true)
| | | |-- field_1: integer (nullable = true)
| | | |-- field_2: string (nullable = true)
Im looking to flatten this such that I have a news struct like this
|-- id: integer (nullable = true)
|-- lower: struct (nullable = true)
|-- lower.field_a: integer (nullable = true)
|-- lower.upper: struct (containsNull = true)
|-- lower.upper.field_A: integer (nullable = true)
|-- lower.upper.num: struct (containsNull = true)
|-- lower.upper.num.field_1: integer (nullable = true)
|-- lower.upper.num.field_2: string (nullable = true)
The reason for this change is so I can put this into a nice table where each column is an element in my nested struct. The column names dont matter to much to me.
I know I can use df.select('*', 'lower.*', 'lower.upper.*' , 'lower.upper.num.*') to get what I want however heres the trick....
This Struct will change over time and I am looking for an elegant way to do flatten the struct without referencing specific columns.
Any ideas? Or tips?
Thanks
Aidonis
โ01-18-2023 12:52 AM
You need to use something like:
from pyspark.sql.types import StructType
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
df.select(flatten(df.schema))
โ01-18-2023 12:52 AM
You need to use something like:
from pyspark.sql.types import StructType
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
df.select(flatten(df.schema))
โ01-18-2023 12:59 AM
@Aidan Heffernanโ
Here you can find a piece of code that is flattening json based on the datatype (Array or Struct)
โ04-17-2024 08:21 PM
You can try this as well:
flatten-spark-dataframe ยท PyPI
This also allows for specific level of flattening.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group