cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Flatten Deep Nested Struct

Aidonis
New Contributor III

Hi All,

I have a deeply nested spark dataframe struct something similar to below

|-- id: integer (nullable = true)

|-- lower: struct (nullable = true)

| |-- field_a: integer (nullable = true)

| |-- upper: struct (containsNull = true)

| | |-- field_A: integer (nullable = true)

| | |-- num: struct (containsNull = true)

| | | |-- field_1: integer (nullable = true)

| | | |-- field_2: string (nullable = true)

Im looking to flatten this such that I have a news struct like this

|-- id: integer (nullable = true)

|-- lower: struct (nullable = true)

|-- lower.field_a: integer (nullable = true)

|-- lower.upper: struct (containsNull = true)

|-- lower.upper.field_A: integer (nullable = true)

|-- lower.upper.num: struct (containsNull = true)

|-- lower.upper.num.field_1: integer (nullable = true)

|-- lower.upper.num.field_2: string (nullable = true)

The reason for this change is so I can put this into a nice table where each column is an element in my nested struct. The column names dont matter to much to me.

I know I can use df.select('*', 'lower.*', 'lower.upper.*' , 'lower.upper.num.*') to get what I want however heres the trick....

This Struct will change over time and I am looking for an elegant way to do flatten the struct without referencing specific columns.

Any ideas? Or tips?

Thanks

Aidonis

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

You need to use something like:

from pyspark.sql.types import StructType
 
def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)
 
    return fields
 
df.select(flatten(df.schema))

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

You need to use something like:

from pyspark.sql.types import StructType
 
def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)
 
    return fields
 
df.select(flatten(df.schema))

daniel_sahal
Esteemed Contributor

@Aidan Heffernan​ 

https://medium.com/@thomaspt748/how-to-flatten-json-files-dynamically-using-apache-pyspark-c6b1b5fd4...

Here you can find a piece of code that is flattening json based on the datatype (Array or Struct)

Praveen-bpk21
New Contributor II

@Aidonis 

You can try this as well:
flatten-spark-dataframe · PyPI
This also allows for specific level of flattening.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.