cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Flatten Deep Nested Struct

Aidonis
New Contributor III

Hi All,

I have a deeply nested spark dataframe struct something similar to below

|-- id: integer (nullable = true)

|-- lower: struct (nullable = true)

| |-- field_a: integer (nullable = true)

| |-- upper: struct (containsNull = true)

| | |-- field_A: integer (nullable = true)

| | |-- num: struct (containsNull = true)

| | | |-- field_1: integer (nullable = true)

| | | |-- field_2: string (nullable = true)

Im looking to flatten this such that I have a news struct like this

|-- id: integer (nullable = true)

|-- lower: struct (nullable = true)

|-- lower.field_a: integer (nullable = true)

|-- lower.upper: struct (containsNull = true)

|-- lower.upper.field_A: integer (nullable = true)

|-- lower.upper.num: struct (containsNull = true)

|-- lower.upper.num.field_1: integer (nullable = true)

|-- lower.upper.num.field_2: string (nullable = true)

The reason for this change is so I can put this into a nice table where each column is an element in my nested struct. The column names dont matter to much to me.

I know I can use df.select('*', 'lower.*', 'lower.upper.*' , 'lower.upper.num.*') to get what I want however heres the trick....

This Struct will change over time and I am looking for an elegant way to do flatten the struct without referencing specific columns.

Any ideas? Or tips?

Thanks

Aidonis

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

You need to use something like:

from pyspark.sql.types import StructType
 
def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)
 
    return fields
 
df.select(flatten(df.schema))

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

You need to use something like:

from pyspark.sql.types import StructType
 
def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)
 
    return fields
 
df.select(flatten(df.schema))

daniel_sahal
Esteemed Contributor

@Aidan Heffernan​ 

https://medium.com/@thomaspt748/how-to-flatten-json-files-dynamically-using-apache-pyspark-c6b1b5fd4...

Here you can find a piece of code that is flattening json based on the datatype (Array or Struct)

Praveen-bpk21
New Contributor II

@Aidonis 

You can try this as well:
flatten-spark-dataframe · PyPI
This also allows for specific level of flattening.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group