Databricks

Aidonis · ‎01-18-2023

Hi All,

I have a deeply nested spark dataframe struct something similar to below

|-- id: integer (nullable = true)

|-- lower: struct (nullable = true)

| |-- field_a: integer (nullable = true)

| |-- upper: struct (containsNull = true)

| | |-- field_A: integer (nullable = true)

| | |-- num: struct (containsNull = true)

| | | |-- field_1: integer (nullable = true)

| | | |-- field_2: string (nullable = true)

Im looking to flatten this such that I have a news struct like this

|-- id: integer (nullable = true)

|-- lower: struct (nullable = true)

|-- lower.field_a: integer (nullable = true)

|-- lower.upper: struct (containsNull = true)

|-- lower.upper.field_A: integer (nullable = true)

|-- lower.upper.num: struct (containsNull = true)

|-- lower.upper.num.field_1: integer (nullable = true)

|-- lower.upper.num.field_2: string (nullable = true)

The reason for this change is so I can put this into a nice table where each column is an element in my nested struct. The column names dont matter to much to me.

I know I can use df.select('*', 'lower.*', 'lower.upper.*' , 'lower.upper.num.*') to get what I want however heres the trick....

This Struct will change over time and I am looking for an elegant way to do flatten the struct without referencing specific columns.

Any ideas? Or tips?

Thanks

Aidonis

Hubert-Dudek · ‎01-18-2023

You need to use something like:

from pyspark.sql.types import StructType
 
def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)
 
    return fields
 
df.select(flatten(df.schema))

View solution in original post

Hubert-Dudek · ‎01-18-2023

You need to use something like:

from pyspark.sql.types import StructType
 
def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)
 
    return fields
 
df.select(flatten(df.schema))