Databricks Community

filipjankovic · ‎07-10-2023

I am trying to convert JSON string stored in variable into spark dataframe without specifying schema, because I have a big number of different tables, so it has to be dynamically. I managed to do it with sc.parallelize, but since we are moving to Unity Catalog, I had to create a Shared Compute cluster, so now sc.parallelize and some other libraries are not working.

I have prepared 3 different JSON strings stored in variable that looks something like this, but originally it has much more rows. I need it to work for all 3 examples.

Onedrive file: JSON conversion sample.dbc

Here is the example of code that is working with Single user cluster, but not with Shared Compute:

import json

data_df = sc.parallelize(value_json).map(lambda x: json.dumps(x))
data_final_df = spark.read.json(data_df)
data_final_df = data_final_df.toDF(*(c.replace('@odata.', '_odata_').replace('.', '_') for c in data_final_df.columns))

display(data_final_df)

cgrant · ‎11-21-2024

Hi filipjankovic,

SparkContext sc is a Spark 1.0 API and is deprecated on Standard and Serverless compute. However, your input data is a list of dictionaries, which are supported with spark.createDataFrame.

This should give you identical output without dropping down to RDD or using the deprecated SparkContext:

data_df = spark.createDataFrame(value_json)
data_final_df = data_df.toDF(*(c.replace('@odata.', '_odata_').replace('.', '_') for c in data_df.columns))
display(data_final_df)

Databricks Community

JSON string object with nested Array and Struct column to dataframe in pyspark

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon