cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Flatten Deep Nested Struct

Aidonis
New Contributor III

Hi All,

I have a deeply nested spark dataframe struct something similar to below

|-- id: integer (nullable = true)

|-- lower: struct (nullable = true)

| |-- field_a: integer (nullable = true)

| |-- upper: struct (containsNull = true)

| | |-- field_A: integer (nullable = true)

| | |-- num: struct (containsNull = true)

| | | |-- field_1: integer (nullable = true)

| | | |-- field_2: string (nullable = true)

Im looking to flatten this such that I have a news struct like this

|-- id: integer (nullable = true)

|-- lower: struct (nullable = true)

|-- lower.field_a: integer (nullable = true)

|-- lower.upper: struct (containsNull = true)

|-- lower.upper.field_A: integer (nullable = true)

|-- lower.upper.num: struct (containsNull = true)

|-- lower.upper.num.field_1: integer (nullable = true)

|-- lower.upper.num.field_2: string (nullable = true)

The reason for this change is so I can put this into a nice table where each column is an element in my nested struct. The column names dont matter to much to me.

I know I can use df.select('*', 'lower.*', 'lower.upper.*' , 'lower.upper.num.*') to get what I want however heres the trick....

This Struct will change over time and I am looking for an elegant way to do flatten the struct without referencing specific columns.

Any ideas? Or tips?

Thanks

Aidonis

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

You need to use something like:

from pyspark.sql.types import StructType
 
def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)
 
    return fields
 
df.select(flatten(df.schema))

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

You need to use something like:

from pyspark.sql.types import StructType
 
def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)
 
    return fields
 
df.select(flatten(df.schema))

daniel_sahal
Esteemed Contributor

@Aidan Heffernanโ€‹ 

https://medium.com/@thomaspt748/how-to-flatten-json-files-dynamically-using-apache-pyspark-c6b1b5fd4...

Here you can find a piece of code that is flattening json based on the datatype (Array or Struct)

Praveen-bpk21
New Contributor II

@Aidonis 

You can try this as well:
flatten-spark-dataframe ยท PyPI
This also allows for specific level of flattening.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group