topic Re: how to flatten non standard Json files in a dataframe in Data Engineering

how to flatten non standard Json files in a dataframe

steelman — Thu, 07 Apr 2022 08:11:36 GMT

hello,

I have a non standard Json file with a nested file structure that I have issues with. Here is an example of the json file.

jsonfile= """[ {
      "success":true,
      "numRows":2,
      "data":{
         "58251":{
            "invoiceno":"58251",
            "name":"invoice1",
            "companyId":"1000",
            "departmentId":"1",
            "lines":{
               "40000":{
                  "invoiceline":"40000",
                  "productid":"1",
                  "amount":"10000",
                  "quantity":"7"
               },
               "40001":{
                  "invoiceline":"40001",
                  "productid":"2",
                  "amount":"9000",
                  "quantity":"7"
               }
            }
         },
         "58252":{
            "invoiceno":"58252",
            "name":"invoice34",
            "companyId":"1001",
            "departmentId":"2",
            "lines":{
               "40002":{
                  "invoiceline":"40002",
                  "productid":"3",
                  "amount":"7000",
                  "quantity":"6"
               },
               "40003":{
                  "invoiceline":"40003",
                  "productid":"2",
                  "amount":"9000",
                  "quantity":"7"
               }            }         }      }   }]"""
import pandas as pd
df = pd.read_json(jsonfile)
display(df)

The problem here is that some of the keys are used as "58252", instead of "id":"58252" or blank value. This is the case for both the "Data" array and the "lines" array. So I have tried using standard functions in spark with json_normalize or explode but it doesnt seem to work with this particular json format.

The endgoal is to get a dataframe looking like this:

I hope you can give me some advice on how to approach and handle these files in databricks.

Re: how to flatten non standard Json files in a dataframe

Hubert-Dudek — Thu, 07 Apr 2022 11:17:49 GMT

I am including example code, as well notebook source and screenshot.

rddjson = sc.parallelize([jsonfile])
df = (spark.read
.json(rddjson)
.select("data.*"))
invoices = spark.createDataFrame(df.rdd.flatMap(lambda x: [y for y in x]))

Re: how to flatten non standard Json files in a dataframe

Hubert-Dudek — Thu, 07 Apr 2022 11:19:10 GMT

so in the next step you can do something similar for lines

Re: how to flatten non standard Json files in a dataframe

steelman — Thu, 07 Apr 2022 18:18:34 GMT

@Hubert Dudek ,

Thanks a lot for your answer. That helped. I still w,

onder how i can keep the connected data between the Data level and the "lines" level.

# i would like to keep reference to invoiceno in lines.*
dfl = invoices.select("lines.*","invoiceno")
 
lines = spark.createDataFrame(dfl.rdd.flatMap(lambda x: [y for y in x]))
 
# returns error message below
display(lines)

I enclose the notebook/DBC file with comments questions as well.

Re: how to flatten non standard Json files in a dataframe

Deepak_Bhutada — Fri, 13 May 2022 16:37:50 GMT

@stale stokkereit

You can use the below function to flatten the struct field

import pyspark.sql.functions as F
 
def flatten_df(nested_df):
    flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
    nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
 
    flat_df = nested_df.select(flat_cols +
                               [F.col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols
                                for c in nested_df.select(nc+'.*').columns])
    return flat_df

Your dataframe would look like this:

Re: how to flatten non standard Json files in a dataframe

jose_gonzalez — Tue, 07 Jun 2022 18:22:11 GMT

Hi @stale stokkereit,

Just a friendly follow-up. Do you still need help or did any of the responses helped you? please let us know

Re: how to flatten non standard Json files in a dataframe

steelman — Tue, 07 Jun 2022 20:49:42 GMT

Thanks for following up.

I didn'get any answers from this forum which I could use directly, but it did help me to move forward.

But I still figured out a solution to the problem after more study of the subject.

The best solution was to write a schema before importing the json file, It tooks some time to write the schema and get in correct format, Ex

StructField("data",

MapType(StringType(),StructType([

StructField("invoiceId", LongType(),True),

After using a schema, spark was able to understand that the numbering sequence on the node below data, was a struct type, that was handled correctly.

After the schema was correctly defined for the json file, it was possible to use spark to do explode operations on the struct node "data"

Ex:

from pyspark.sql.functions import explode, col

df1 = dfresult.select(explode('data'))

df2 = df1.select("value.*")

In the end I got all data into a normalized table.

thanks for all contributions and efforts to help.