cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

JSON string object with nested Array and Struct column to dataframe in pyspark

filipjankovic
New Contributor

I am trying to convert JSON string stored in variable into spark dataframe without specifying schema, because I have a big number of different tables, so it has to be dynamically. I managed to do it with sc.parallelize, but since we are moving to Unity Catalog, I had to create a Shared Compute cluster, so now sc.parallelize and some other libraries are not working.

I have prepared 3 different JSON strings stored in variable that looks something like this, but originally it has much more rows. I need it to work for all 3 examples.

Onedrive file: JSON conversion sample.dbc

Here is the example of code that is working with Single user cluster, but not with Shared Compute:

import json

data_df = sc.parallelize(value_json).map(lambda x: json.dumps(x))
data_final_df = spark.read.json(data_df)
data_final_df = data_final_df.toDF(*(c.replace('@odata.', '_odata_').replace('.', '_') for c in data_final_df.columns))

display(data_final_df)

 

 
 

 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @filipjankovic, Since you have multiple tables and need dynamic schema inference, I recommend using the following approach:

  1. Schema Inference from JSON String: You can infer the schema from the JSON string and then create a DataFrame.

  2. Schema Inference from First Row: If you want to infer the schema from the first row of your JSON data, you can follow this approach:

    ## Assuming you have a DataFrame with a column 'temp_json_string
    # Get the first row's JSON string
    val json = df.select("temp_json_string").collect()(0).getString(0)
    
    # Infer schema from the JSON string
    val ddlSchema: String = spark.sql(s"select schema_of_json('${json}')").collect()(0).getString(0)
    
    # Create DataFrame with inferred schema
    val data_final_df = spark.read.schema(ddlSchema).json(ds)
    

Remember to replace df with your actual DataFrame containing the JSON strings. These approaches should work regardless of whether you’re using a Single User cluster or a Shared Compute cluster. Let me know if you need further assistance!

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!