Databricks

steelman · ‎04-07-2022

hello,

I have a non standard Json file with a nested file structure that I have issues with. Here is an example of the json file.

jsonfile= """[ {
      "success":true,
      "numRows":2,
      "data":{
         "58251":{
            "invoiceno":"58251",
            "name":"invoice1",
            "companyId":"1000",
            "departmentId":"1",
            "lines":{
               "40000":{
                  "invoiceline":"40000",
                  "productid":"1",
                  "amount":"10000",
                  "quantity":"7"
               },
               "40001":{
                  "invoiceline":"40001",
                  "productid":"2",
                  "amount":"9000",
                  "quantity":"7"
               }
            }
         },
         "58252":{
            "invoiceno":"58252",
            "name":"invoice34",
            "companyId":"1001",
            "departmentId":"2",
            "lines":{
               "40002":{
                  "invoiceline":"40002",
                  "productid":"3",
                  "amount":"7000",
                  "quantity":"6"
               },
               "40003":{
                  "invoiceline":"40003",
                  "productid":"2",
                  "amount":"9000",
                  "quantity":"7"
               }            }         }      }   }]"""
import pandas as pd
df = pd.read_json(jsonfile)
display(df)

The problem here is that some of the keys are used as "58252", instead of "id":"58252" or blank value. This is the case for both the "Data" array and the "lines" array. So I have tried using standard functions in spark with json_normalize or explode but it doesnt seem to work with this particular json format.

The endgoal is to get a dataframe looking like this:

I hope you can give me some advice on how to approach and handle these files in databricks.

steelman · ‎06-07-2022

Thanks for following up.

I didn'get any answers from this forum which I could use directly, but it did help me to move forward.

But I still figured out a solution to the problem after more study of the subject.

The best solution was to write a schema before importing the json file, It tooks some time to write the schema and get in correct format, Ex

StructField("data",

MapType(StringType(),StructType([

StructField("invoiceId", LongType(),True),

After using a schema, spark was able to understand that the numbering sequence on the node below data, was a struct type, that was handled correctly.

After the schema was correctly defined for the json file, it was possible to use spark to do explode operations on the struct node "data"

Ex:

from pyspark.sql.functions import explode, col

df1 = dfresult.select(explode('data'))

df2 = df1.select("value.*")

In the end I got all data into a normalized table.

thanks for all contributions and efforts to help.

View solution in original post

Hubert-Dudek · ‎04-07-2022

I am including example code, as well notebook source and screenshot.

rddjson = sc.parallelize([jsonfile])
df = (spark.read
.json(rddjson)
.select("data.*"))
invoices = spark.createDataFrame(df.rdd.flatMap(lambda x: [y for y in x]))

Hubert-Dudek · ‎04-07-2022

so in the next step you can do something similar for lines

steelman · ‎04-07-2022

@Hubert Dudek ,

Thanks a lot for your answer. That helped. I still w,

onder how i can keep the connected data between the Data level and the "lines" level.

# i would like to keep reference to invoiceno in lines.*
dfl = invoices.select("lines.*","invoiceno")
 
lines = spark.createDataFrame(dfl.rdd.flatMap(lambda x: [y for y in x]))
 
# returns error message below
display(lines)

I enclose the notebook/DBC file with comments questions as well.

Deepak_Bhutada · ‎05-13-2022

@stale stokkereit

You can use the below function to flatten the struct field

import pyspark.sql.functions as F
 
def flatten_df(nested_df):
    flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
    nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
 
    flat_df = nested_df.select(flat_cols +
                               [F.col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols
                                for c in nested_df.select(nc+'.*').columns])
    return flat_df

Your dataframe would look like this:

jose_gonzalez · ‎06-07-2022

Hi @stale stokkereit,

Just a friendly follow-up. Do you still need help or did any of the responses helped you? please let us know

steelman · ‎06-07-2022