<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic JSON string object with nested Array and Struct column to dataframe in pyspark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/json-string-object-with-nested-array-and-struct-column-to/m-p/37283#M26312</link>
    <description>&lt;P&gt;I am trying to convert JSON string stored in variable into spark dataframe without specifying schema, because I have a big number of different tables, so it has to be dynamically. I managed to do it with sc.parallelize, but since we are moving to Unity Catalog, I had to create a Shared Compute cluster, so now sc.parallelize and some other libraries are not working.&lt;/P&gt;&lt;P&gt;I have prepared 3 different JSON strings stored in variable that looks something like this, but originally it has much more rows. I need it to work for all 3 examples.&lt;/P&gt;&lt;P&gt;Onedrive file:&amp;nbsp;&lt;A href="https://callistadigital-my.sharepoint.com/:u:/g/personal/filip_jankovic_digital_callista_ch/Ef1YTRlqhspFtJs0FXfYBo0B-35VpEyNKUFYoYfCip8zMg?e=sGDNUp" target="_blank"&gt;JSON conversion sample.dbc&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Here is the example of code that is working with Single user cluster, but not with Shared Compute:&lt;/P&gt;&lt;P&gt;import json&lt;/P&gt;&lt;P&gt;data_df = sc.parallelize(value_json).map(lambda x: json.dumps(x))&lt;BR /&gt;data_final_df = spark.read.json(data_df)&lt;BR /&gt;data_final_df = data_final_df.toDF(*(c.replace('@odata.', '_odata_').replace('.', '_') for c in data_final_df.columns))&lt;/P&gt;&lt;P&gt;display(data_final_df)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Mon, 10 Jul 2023 09:06:14 GMT</pubDate>
    <dc:creator>filipjankovic</dc:creator>
    <dc:date>2023-07-10T09:06:14Z</dc:date>
    <item>
      <title>JSON string object with nested Array and Struct column to dataframe in pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/json-string-object-with-nested-array-and-struct-column-to/m-p/37283#M26312</link>
      <description>&lt;P&gt;I am trying to convert JSON string stored in variable into spark dataframe without specifying schema, because I have a big number of different tables, so it has to be dynamically. I managed to do it with sc.parallelize, but since we are moving to Unity Catalog, I had to create a Shared Compute cluster, so now sc.parallelize and some other libraries are not working.&lt;/P&gt;&lt;P&gt;I have prepared 3 different JSON strings stored in variable that looks something like this, but originally it has much more rows. I need it to work for all 3 examples.&lt;/P&gt;&lt;P&gt;Onedrive file:&amp;nbsp;&lt;A href="https://callistadigital-my.sharepoint.com/:u:/g/personal/filip_jankovic_digital_callista_ch/Ef1YTRlqhspFtJs0FXfYBo0B-35VpEyNKUFYoYfCip8zMg?e=sGDNUp" target="_blank"&gt;JSON conversion sample.dbc&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Here is the example of code that is working with Single user cluster, but not with Shared Compute:&lt;/P&gt;&lt;P&gt;import json&lt;/P&gt;&lt;P&gt;data_df = sc.parallelize(value_json).map(lambda x: json.dumps(x))&lt;BR /&gt;data_final_df = spark.read.json(data_df)&lt;BR /&gt;data_final_df = data_final_df.toDF(*(c.replace('@odata.', '_odata_').replace('.', '_') for c in data_final_df.columns))&lt;/P&gt;&lt;P&gt;display(data_final_df)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 10 Jul 2023 09:06:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-string-object-with-nested-array-and-struct-column-to/m-p/37283#M26312</guid>
      <dc:creator>filipjankovic</dc:creator>
      <dc:date>2023-07-10T09:06:14Z</dc:date>
    </item>
    <item>
      <title>Re: JSON string object with nested Array and Struct column to dataframe in pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/json-string-object-with-nested-array-and-struct-column-to/m-p/99712#M40067</link>
      <description>&lt;P data-unlink="true"&gt;Hi&amp;nbsp;&lt;SPAN class=""&gt;filipjankovic,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P data-unlink="true"&gt;SparkContext &lt;EM&gt;sc&lt;/EM&gt; is a Spark 1.0 API and is deprecated on Standard and Serverless compute. However, your input data is a list of dictionaries, which are supported with spark.createDataFrame.&lt;/P&gt;
&lt;P data-unlink="true"&gt;This should give you identical output without dropping down to RDD or using the deprecated SparkContext:&lt;/P&gt;
&lt;DIV&gt;
&lt;DIV&gt;
&lt;PRE&gt;data_df = spark.createDataFrame(value_json)
data_final_df = data_df.toDF(*(c.replace('@odata.', '_odata_').replace('.', '_') for c in data_df.columns))
display(data_final_df)&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;</description>
      <pubDate>Thu, 21 Nov 2024 22:46:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-string-object-with-nested-array-and-struct-column-to/m-p/99712#M40067</guid>
      <dc:creator>cgrant</dc:creator>
      <dc:date>2024-11-21T22:46:08Z</dc:date>
    </item>
  </channel>
</rss>

