<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Nested struct type not supported pyspark error in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/nested-struct-type-not-supported-pyspark-error/m-p/4946#M1509</link>
    <description>&lt;P&gt;I am attempting to apply a function to a pyspark DataFrame and save the API response to a new column and then parse using `json_normalize`. This works fine in pandas, however, I run into an exception with `pyspark`.&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt; import pyspark.pandas as ps
&amp;nbsp;
  import pandas as pd
&amp;nbsp;
  import requests
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  def get_vals(row):
&amp;nbsp;
   
&amp;nbsp;
    # make api call
&amp;nbsp;
    return row['A'] * row['B']
&amp;nbsp;
   
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  # Create a pandas DataFrame
&amp;nbsp;
  pdf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  # apply function - get api responses
&amp;nbsp;
  pdf['api_response'] = pdf.apply(lambda row: get_vals(row), axis=1)
&amp;nbsp;
  pdf.sample(5)
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  # Unpack JSON API Response
&amp;nbsp;
  try:
&amp;nbsp;
    dff = pd.json_normalize(pdf['api_response'].str['location'])
&amp;nbsp;
  except TypeError as e:
&amp;nbsp;
    print(f"Error: {e}")
&amp;nbsp;
    print(f"Problematic data: {data['data']}")
&amp;nbsp;
&amp;nbsp;
  # To pySpark DataFrame
&amp;nbsp;
  psdf = ps.DataFrame(df)
&amp;nbsp;
  psdf.head(5)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Expected output is a `json` normalized DataFrame. When I attempt to apply the function over a `pyspark` DataFrame, it throws an exception:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;psdf['api_response'] = psdf.apply(lambda row: get_vals(row), axis=1)
&amp;nbsp;
    
&amp;nbsp;
 ---------------------------------------------------------------------------
&amp;nbsp;
  TypeError                 Traceback (most recent call last)
&amp;nbsp;
  File &amp;lt;command-4372401754138893&amp;gt;:2
&amp;nbsp;
     1 
&amp;nbsp;
  ----&amp;gt; 2 psdf['api_response'] = psdf.apply(lambda row: get_vals(row), axis=1)
&amp;nbsp;
   
&amp;nbsp;
  TypeError: Nested StructType not supported in conversion from Arrow: struct&amp;lt;data: struct&amp;lt;geographies: &lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 02 May 2023 03:21:36 GMT</pubDate>
    <dc:creator>kll</dc:creator>
    <dc:date>2023-05-02T03:21:36Z</dc:date>
    <item>
      <title>Nested struct type not supported pyspark error</title>
      <link>https://community.databricks.com/t5/data-engineering/nested-struct-type-not-supported-pyspark-error/m-p/4946#M1509</link>
      <description>&lt;P&gt;I am attempting to apply a function to a pyspark DataFrame and save the API response to a new column and then parse using `json_normalize`. This works fine in pandas, however, I run into an exception with `pyspark`.&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt; import pyspark.pandas as ps
&amp;nbsp;
  import pandas as pd
&amp;nbsp;
  import requests
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  def get_vals(row):
&amp;nbsp;
   
&amp;nbsp;
    # make api call
&amp;nbsp;
    return row['A'] * row['B']
&amp;nbsp;
   
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  # Create a pandas DataFrame
&amp;nbsp;
  pdf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  # apply function - get api responses
&amp;nbsp;
  pdf['api_response'] = pdf.apply(lambda row: get_vals(row), axis=1)
&amp;nbsp;
  pdf.sample(5)
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  # Unpack JSON API Response
&amp;nbsp;
  try:
&amp;nbsp;
    dff = pd.json_normalize(pdf['api_response'].str['location'])
&amp;nbsp;
  except TypeError as e:
&amp;nbsp;
    print(f"Error: {e}")
&amp;nbsp;
    print(f"Problematic data: {data['data']}")
&amp;nbsp;
&amp;nbsp;
  # To pySpark DataFrame
&amp;nbsp;
  psdf = ps.DataFrame(df)
&amp;nbsp;
  psdf.head(5)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Expected output is a `json` normalized DataFrame. When I attempt to apply the function over a `pyspark` DataFrame, it throws an exception:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;psdf['api_response'] = psdf.apply(lambda row: get_vals(row), axis=1)
&amp;nbsp;
    
&amp;nbsp;
 ---------------------------------------------------------------------------
&amp;nbsp;
  TypeError                 Traceback (most recent call last)
&amp;nbsp;
  File &amp;lt;command-4372401754138893&amp;gt;:2
&amp;nbsp;
     1 
&amp;nbsp;
  ----&amp;gt; 2 psdf['api_response'] = psdf.apply(lambda row: get_vals(row), axis=1)
&amp;nbsp;
   
&amp;nbsp;
  TypeError: Nested StructType not supported in conversion from Arrow: struct&amp;lt;data: struct&amp;lt;geographies: &lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 02 May 2023 03:21:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/nested-struct-type-not-supported-pyspark-error/m-p/4946#M1509</guid>
      <dc:creator>kll</dc:creator>
      <dc:date>2023-05-02T03:21:36Z</dc:date>
    </item>
    <item>
      <title>Re: Nested struct type not supported pyspark error</title>
      <link>https://community.databricks.com/t5/data-engineering/nested-struct-type-not-supported-pyspark-error/m-p/4947#M1510</link>
      <description>&lt;P&gt;@Keval Shah​&amp;nbsp;:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The error message suggests that the issue is with the schema of the psdf DataFrame, specifically with the nested struct type of the api_response column.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Unfortunately, PySpark does not support nested struct types when converting from Arrow format, which is used internally by PySpark DataFrames. As a result, you cannot directly apply a function to a PySpark DataFrame that returns a nested struct type.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;One workaround is to use the pandas_udf function in PySpark, which allows you to apply a Pandas UDF (user-defined function) to a PySpark DataFrame. You can define a Pandas UDF that takes a Pandas DataFrame as input, applies the function to the DataFrame, and returns a new Pandas DataFrame with the normalized JSON data. Here's an example of how you can modify your code to use pandas_udf:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, DoubleType
&amp;nbsp;
# Define the schema of the output DataFrame
output_schema = StructType([
    StructField("location.latitude", DoubleType()),
    StructField("location.longitude", DoubleType())
])
&amp;nbsp;
# Define the function that applies the API call and normalizes the JSON response
def api_call_udf(pdf):
    # Apply the API call to the input Pandas DataFrame
    pdf['api_response'] = pdf.apply(lambda row: get_vals(row), axis=1)
    # Normalize the JSON data using pandas.json_normalize
    normalized_df = pd.json_normalize(pdf['api_response'].str['location'])
    # Return the normalized Pandas DataFrame as an Arrow table
    return normalized_df.to_arrow()
&amp;nbsp;
# Convert the PySpark DataFrame to a Pandas DataFrame
pdf = psdf.to_pandas()
&amp;nbsp;
# Apply the function using pandas_udf
normalized_df = psdf.withColumn("output", F.pandas_udf(api_call_udf, output_schema)(F.struct([F.col(x) for x in pdf.columns])))
&amp;nbsp;
# Convert the output PySpark DataFrame back to a PySpark Pandas DataFrame
psdf_output = ps.DataFrame(normalized_df)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;In this code, we define a schema for the output DataFrame and define a function that applies the API call to the input Pandas DataFrame, normalizes the JSON data using pandas.json_normalize, and returns the normalized Pandas DataFrame as an Arrow table. We then convert the PySpark DataFrame to a Pandas DataFrame, apply the function using pandas_udf, and convert the output PySpark DataFrame back to a PySpark Pandas DataFrame.&lt;/P&gt;</description>
      <pubDate>Sat, 13 May 2023 15:49:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/nested-struct-type-not-supported-pyspark-error/m-p/4947#M1510</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-05-13T15:49:09Z</dc:date>
    </item>
    <item>
      <title>Re: Nested struct type not supported pyspark error</title>
      <link>https://community.databricks.com/t5/data-engineering/nested-struct-type-not-supported-pyspark-error/m-p/4948#M1511</link>
      <description>&lt;P&gt;Hi @Keval Shah​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for posting your question in our community! We are happy to assist you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 May 2023 06:25:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/nested-struct-type-not-supported-pyspark-error/m-p/4948#M1511</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-05-19T06:25:47Z</dc:date>
    </item>
  </channel>
</rss>

