topic Flatten Deep Nested Struct in Data Engineering

Flatten Deep Nested Struct

Aidonis — Wed, 18 Jan 2023 08:39:30 GMT

Hi All,

I have a deeply nested spark dataframe struct something similar to below

|-- id: integer (nullable = true)

|-- lower: struct (nullable = true)

| |-- field_a: integer (nullable = true)

| |-- upper: struct (containsNull = true)

| | |-- field_A: integer (nullable = true)

| | |-- num: struct (containsNull = true)

| | | |-- field_1: integer (nullable = true)

| | | |-- field_2: string (nullable = true)

Im looking to flatten this such that I have a news struct like this

|-- id: integer (nullable = true)

|-- lower: struct (nullable = true)

|-- lower.field_a: integer (nullable = true)

|-- lower.upper: struct (containsNull = true)

|-- lower.upper.field_A: integer (nullable = true)

|-- lower.upper.num: struct (containsNull = true)

|-- lower.upper.num.field_1: integer (nullable = true)

|-- lower.upper.num.field_2: string (nullable = true)

The reason for this change is so I can put this into a nice table where each column is an element in my nested struct. The column names dont matter to much to me.

I know I can use df.select('*', 'lower.*', 'lower.upper.*' , 'lower.upper.num.*') to get what I want however heres the trick....

This Struct will change over time and I am looking for an elegant way to do flatten the struct without referencing specific columns.

Any ideas? Or tips?

Thanks

Aidonis

Re: Flatten Deep Nested Struct

Hubert-Dudek — Wed, 18 Jan 2023 08:52:16 GMT

You need to use something like:

from pyspark.sql.types import StructType
 
def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)
 
    return fields
 
df.select(flatten(df.schema))

Re: Flatten Deep Nested Struct

daniel_sahal — Wed, 18 Jan 2023 08:59:18 GMT

@Aidan Heffernan

https://medium.com/@thomaspt748/how-to-flatten-json-files-dynamically-using-apache-pyspark-c6b1b5fd4777

Here you can find a piece of code that is flattening json based on the datatype (Array or Struct)

Re: Flatten Deep Nested Struct

Praveen-bpk21 — Thu, 18 Apr 2024 03:21:17 GMT

@Aidonis

You can try this as well:
flatten-spark-dataframe · PyPI
This also allows for specific level of flattening.