cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

how to flatten non standard Json files in a dataframe

steelman
New Contributor III

hello,

I have a non standard Json file with a nested file structure that I have issues with. Here is an example of the json file.

jsonfile= """[ {
      "success":true,
      "numRows":2,
      "data":{
         "58251":{
            "invoiceno":"58251",
            "name":"invoice1",
            "companyId":"1000",
            "departmentId":"1",
            "lines":{
               "40000":{
                  "invoiceline":"40000",
                  "productid":"1",
                  "amount":"10000",
                  "quantity":"7"
               },
               "40001":{
                  "invoiceline":"40001",
                  "productid":"2",
                  "amount":"9000",
                  "quantity":"7"
               }
            }
         },
         "58252":{
            "invoiceno":"58252",
            "name":"invoice34",
            "companyId":"1001",
            "departmentId":"2",
            "lines":{
               "40002":{
                  "invoiceline":"40002",
                  "productid":"3",
                  "amount":"7000",
                  "quantity":"6"
               },
               "40003":{
                  "invoiceline":"40003",
                  "productid":"2",
                  "amount":"9000",
                  "quantity":"7"
               }            }         }      }   }]"""
import pandas as pd
df = pd.read_json(jsonfile)
display(df)

The problem here is that some of the keys are used as "58252", instead of "id":"58252" or blank value. This is the case for both the "Data" array and the "lines" array. So I have tried using standard functions in spark with json_normalize or explode but it doesnt seem to work with this particular json format.

The endgoal is to get a dataframe looking like this:

desired format in the dataframe after processing the json fileI hope you can give me some advice on how to approach and handle these files in databricks.

1 ACCEPTED SOLUTION

Accepted Solutions

Thanks for following up.

I didn'get any answers from this forum which I could use directly, but it did help me to move forward.

But I still figured out a solution to the problem after more study of the subject.

The best solution was to write a schema before importing the json file, It tooks some time to write the schema and get in correct format, Ex

StructField("data",

        MapType(StringType(),StructType([

           StructField("invoiceId", LongType(),True),

After using a schema, spark was able to understand that the numbering sequence on the node below data, was a struct type, that was handled correctly.

After the schema was correctly defined for the json file, it was possible to use spark to do explode operations on the struct node "data"

Ex:

from pyspark.sql.functions import explode, col

df1 = dfresult.select(explode('data'))

df2 = df1.select("value.*")

In the end I got all data into a normalized table.

thanks for all contributions and efforts to help.

View solution in original post

6 REPLIES 6

Hubert-Dudek
Esteemed Contributor III

I am including example code, as well notebook source and screenshot.

rddjson = sc.parallelize([jsonfile])
df = (spark.read
.json(rddjson)
.select("data.*"))
invoices = spark.createDataFrame(df.rdd.flatMap(lambda x: [y for y in x]))

image.png

Hubert-Dudek
Esteemed Contributor III

so in the next step you can do something similar for lines

steelman
New Contributor III

@Hubert Dudek​ ,  

Thanks a lot for your answer. That helped. I still w,

onder how i can keep the connected data between the Data level and the "lines" level.

# i would like to keep reference to invoiceno in lines.*
dfl = invoices.select("lines.*","invoiceno")
 
lines = spark.createDataFrame(dfl.rdd.flatMap(lambda x: [y for y in x]))
 
# returns error message below
display(lines)

I enclose the notebook/DBC file with comments questions as well.

Deepak_Bhutada
Contributor III

@stale stokkereit​ 

You can use the below function to flatten the struct field

import pyspark.sql.functions as F
 
def flatten_df(nested_df):
    flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
    nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
 
    flat_df = nested_df.select(flat_cols +
                               [F.col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols
                                for c in nested_df.select(nc+'.*').columns])
    return flat_df

Your dataframe would look like this:

image

Hi @stale stokkereit​,

Just a friendly follow-up. Do you still need help or did any of the responses helped you? please let us know 

Thanks for following up.

I didn'get any answers from this forum which I could use directly, but it did help me to move forward.

But I still figured out a solution to the problem after more study of the subject.

The best solution was to write a schema before importing the json file, It tooks some time to write the schema and get in correct format, Ex

StructField("data",

        MapType(StringType(),StructType([

           StructField("invoiceId", LongType(),True),

After using a schema, spark was able to understand that the numbering sequence on the node below data, was a struct type, that was handled correctly.

After the schema was correctly defined for the json file, it was possible to use spark to do explode operations on the struct node "data"

Ex:

from pyspark.sql.functions import explode, col

df1 = dfresult.select(explode('data'))

df2 = df1.select("value.*")

In the end I got all data into a normalized table.

thanks for all contributions and efforts to help.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group