cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Dynamically detect if any dataframe column is an array type, to perform logic on that column

ilarsen
Contributor

Hi,

 

I'd like to put this out here in case there are some helpful suggestions to be found.

 

What am I trying to achieve?

Generate a hash of certain columns in a dataframe (as in a row hash, but not the whole row) where currently one of the columns is an array of struct.  Without explicitly referencing the column(s) by name.

 

Where have I got to?

I have achieved what I want to do... sort of.  By specifying the columns and using the sha2() and to_json() functions to convert the array of structs in to a string, enabling me to use sha2.

 

What's the problem?

I don't want to specify the column (or columns) by name.  The data is coming from an API, it's JSON format, and I want to safeguard against changes in schema.  If the API payload changes without warning, my aim is for our process to adjust without intervention.  So, if the current array of nested objects column changes name, I don't want it to break.

 

What have I tried?
I've tried playing around with dataframe.schema and dataframe.dtypes.  I can't get a simple conditional true/false return if the column is an array.  The datatypes seem to be ArrayType, yes, but following from that, listing the schema, so all the nested columns etc.  So I haven't got something like if dataType is array: true else false working.

 

Source format example:

 

 

 

 

 

 

{
"name":"value",
"Array"
 [
   {
     "Id":1234
     "Name":"some name"
     ...
   },
   {
    ...
   }
 ]
}

 

 

 

 

 

 

 

Anyone have any ideas?

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

That is totally possible.
f.e. here is a function that trims all string columns in a dataframe.  You can change it to your needs:

def trim_all_string_columns(df: dataframe) -> dataframe:
        for c in df.schema.fields:
            if isinstance(c.dataType, StringType):
                df = df.withColumn(c.name, F.trim(F.col(c.name)))
        return df

View solution in original post

2 REPLIES 2

-werners-
Esteemed Contributor III

That is totally possible.
f.e. here is a function that trims all string columns in a dataframe.  You can change it to your needs:

def trim_all_string_columns(df: dataframe) -> dataframe:
        for c in df.schema.fields:
            if isinstance(c.dataType, StringType):
                df = df.withColumn(c.name, F.trim(F.col(c.name)))
        return df

Thanks for that.  The isinstance is what I was looking for and did help me out.  Although, I didn't end up continuing on that track.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!