- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-18-2023 12:39 AM
Hi All,
I have a deeply nested spark dataframe struct something similar to below
|-- id: integer (nullable = true)
|-- lower: struct (nullable = true)
| |-- field_a: integer (nullable = true)
| |-- upper: struct (containsNull = true)
| | |-- field_A: integer (nullable = true)
| | |-- num: struct (containsNull = true)
| | | |-- field_1: integer (nullable = true)
| | | |-- field_2: string (nullable = true)
Im looking to flatten this such that I have a news struct like this
|-- id: integer (nullable = true)
|-- lower: struct (nullable = true)
|-- lower.field_a: integer (nullable = true)
|-- lower.upper: struct (containsNull = true)
|-- lower.upper.field_A: integer (nullable = true)
|-- lower.upper.num: struct (containsNull = true)
|-- lower.upper.num.field_1: integer (nullable = true)
|-- lower.upper.num.field_2: string (nullable = true)
The reason for this change is so I can put this into a nice table where each column is an element in my nested struct. The column names dont matter to much to me.
I know I can use df.select('*', 'lower.*', 'lower.upper.*' , 'lower.upper.num.*') to get what I want however heres the trick....
This Struct will change over time and I am looking for an elegant way to do flatten the struct without referencing specific columns.
Any ideas? Or tips?
Thanks
Aidonis
- Labels:
-
Nested
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-18-2023 12:52 AM
You need to use something like:
from pyspark.sql.types import StructType
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
df.select(flatten(df.schema))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-18-2023 12:52 AM
You need to use something like:
from pyspark.sql.types import StructType
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
df.select(flatten(df.schema))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-18-2023 12:59 AM
@Aidan Heffernanโ
Here you can find a piece of code that is flattening json based on the datatype (Array or Struct)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ04-17-2024 08:21 PM
You can try this as well:
flatten-spark-dataframe ยท PyPI
This also allows for specific level of flattening.

