<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Dynamically detect if any dataframe column is an array type, to perform logic on that column in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dynamically-detect-if-any-dataframe-column-is-an-array-type-to/m-p/49082#M28471</link>
    <description>&lt;P&gt;That is totally possible.&lt;BR /&gt;f.e. here is a function that trims all string columns in a dataframe.&amp;nbsp; You can change it to your needs:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def trim_all_string_columns(df: dataframe) -&amp;gt; dataframe:
        for c in df.schema.fields:
            if isinstance(c.dataType, StringType):
                df = df.withColumn(c.name, F.trim(F.col(c.name)))
        return df&lt;/LI-CODE&gt;</description>
    <pubDate>Fri, 13 Oct 2023 07:34:33 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2023-10-13T07:34:33Z</dc:date>
    <item>
      <title>Dynamically detect if any dataframe column is an array type, to perform logic on that column</title>
      <link>https://community.databricks.com/t5/data-engineering/dynamically-detect-if-any-dataframe-column-is-an-array-type-to/m-p/49075#M28467</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'd like to put this out here in case there are some helpful suggestions to be found.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;What am I trying to achieve?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Generate a hash of certain columns in a dataframe (as in a row hash, but not the whole row) where currently one of the columns is an array of struct.&amp;nbsp; Without explicitly referencing the column(s) by name.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Where have I got to?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I have achieved what I want to do... sort of.&amp;nbsp; By specifying the columns and using the sha2() and to_json() functions to convert the array of structs in to a string, enabling me to use sha2.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;What's the problem?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I don't want to specify the column (or columns) by name.&amp;nbsp; The data is coming from an API, it's JSON format, and I want to safeguard against changes in schema.&amp;nbsp; If the API payload changes without warning, my aim is for our process to adjust without intervention.&amp;nbsp; So, if the current array of nested objects column changes name, I don't want it to break.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;What have I tried?&lt;/STRONG&gt;&lt;BR /&gt;I've tried playing around with dataframe.schema and dataframe.dtypes.&amp;nbsp; I can't get a simple conditional true/false return if the column is an array.&amp;nbsp; The datatypes seem to be ArrayType, yes, but following from that, listing the schema, so all the nested columns etc.&amp;nbsp; So I haven't got something like&amp;nbsp;&lt;EM&gt;if dataType is array: true else false&lt;/EM&gt;&amp;nbsp;working.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Source format example:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;{
"name":"value",
"Array"
 [
   {
     "Id":1234
     "Name":"some name"
     ...
   },
   {
    ...
   }
 ]
}&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Anyone have any ideas?&lt;/P&gt;</description>
      <pubDate>Fri, 13 Oct 2023 03:54:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dynamically-detect-if-any-dataframe-column-is-an-array-type-to/m-p/49075#M28467</guid>
      <dc:creator>ilarsen</dc:creator>
      <dc:date>2023-10-13T03:54:58Z</dc:date>
    </item>
    <item>
      <title>Re: Dynamically detect if any dataframe column is an array type, to perform logic on that column</title>
      <link>https://community.databricks.com/t5/data-engineering/dynamically-detect-if-any-dataframe-column-is-an-array-type-to/m-p/49082#M28471</link>
      <description>&lt;P&gt;That is totally possible.&lt;BR /&gt;f.e. here is a function that trims all string columns in a dataframe.&amp;nbsp; You can change it to your needs:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def trim_all_string_columns(df: dataframe) -&amp;gt; dataframe:
        for c in df.schema.fields:
            if isinstance(c.dataType, StringType):
                df = df.withColumn(c.name, F.trim(F.col(c.name)))
        return df&lt;/LI-CODE&gt;</description>
      <pubDate>Fri, 13 Oct 2023 07:34:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dynamically-detect-if-any-dataframe-column-is-an-array-type-to/m-p/49082#M28471</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-10-13T07:34:33Z</dc:date>
    </item>
    <item>
      <title>Re: Dynamically detect if any dataframe column is an array type, to perform logic on that column</title>
      <link>https://community.databricks.com/t5/data-engineering/dynamically-detect-if-any-dataframe-column-is-an-array-type-to/m-p/49890#M28639</link>
      <description>&lt;P&gt;Thanks for that.&amp;nbsp; The&amp;nbsp;&lt;EM&gt;isinstance&lt;/EM&gt; is what I was looking for and did help me out.&amp;nbsp; Although, I didn't end up continuing on that track.&lt;/P&gt;</description>
      <pubDate>Wed, 25 Oct 2023 22:31:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dynamically-detect-if-any-dataframe-column-is-an-array-type-to/m-p/49890#M28639</guid>
      <dc:creator>ilarsen</dc:creator>
      <dc:date>2023-10-25T22:31:25Z</dc:date>
    </item>
  </channel>
</rss>

