JSON string object with nested Array and Struct column to dataframe in pyspark

filipjankovic — Mon, 10 Jul 2023 09:06:14 GMT

I am trying to convert JSON string stored in variable into spark dataframe without specifying schema, because I have a big number of different tables, so it has to be dynamically. I managed to do it with sc.parallelize, but since we are moving to Unity Catalog, I had to create a Shared Compute cluster, so now sc.parallelize and some other libraries are not working.

I have prepared 3 different JSON strings stored in variable that looks something like this, but originally it has much more rows. I need it to work for all 3 examples.

Onedrive file: JSON conversion sample.dbc

Here is the example of code that is working with Single user cluster, but not with Shared Compute:

import json

data_df = sc.parallelize(value_json).map(lambda x: json.dumps(x))
data_final_df = spark.read.json(data_df)
data_final_df = data_final_df.toDF(*(c.replace('@odata.', '_odata_').replace('.', '_') for c in data_final_df.columns))

display(data_final_df)

Re: JSON string object with nested Array and Struct column to dataframe in pyspark

cgrant — Thu, 21 Nov 2024 22:46:08 GMT

Hi filipjankovic,

SparkContext sc is a Spark 1.0 API and is deprecated on Standard and Serverless compute. However, your input data is a list of dictionaries, which are supported with spark.createDataFrame.

This should give you identical output without dropping down to RDD or using the deprecated SparkContext:

data_df = spark.createDataFrame(value_json)
data_final_df = data_df.toDF(*(c.replace('@odata.', '_odata_').replace('.', '_') for c in data_df.columns))
display(data_final_df)

topic JSON string object with nested Array and Struct column to dataframe in pyspark in Data Engineering

JSON string object with nested Array and Struct column to dataframe in pyspark

Re: JSON string object with nested Array and Struct column to dataframe in pyspark