how to flatten non standard Json files in a datafr...

steelman · ‎04-07-2022

hello,

I have a non standard Json file with a nested file structure that I have issues with. Here is an example of the json file.

jsonfile= """[ {
      "success":true,
      "numRows":2,
      "data":{
         "58251":{
            "invoiceno":"58251",
            "name":"invoice1",
            "companyId":"1000",
            "departmentId":"1",
            "lines":{
               "40000":{
                  "invoiceline":"40000",
                  "productid":"1",
                  "amount":"10000",
                  "quantity":"7"
               },
               "40001":{
                  "invoiceline":"40001",
                  "productid":"2",
                  "amount":"9000",
                  "quantity":"7"
               }
            }
         },
         "58252":{
            "invoiceno":"58252",
            "name":"invoice34",
            "companyId":"1001",
            "departmentId":"2",
            "lines":{
               "40002":{
                  "invoiceline":"40002",
                  "productid":"3",
                  "amount":"7000",
                  "quantity":"6"
               },
               "40003":{
                  "invoiceline":"40003",
                  "productid":"2",
                  "amount":"9000",
                  "quantity":"7"
               }            }         }      }   }]"""
import pandas as pd
df = pd.read_json(jsonfile)
display(df)

The problem here is that some of the keys are used as "58252", instead of "id":"58252" or blank value. This is the case for both the "Data" array and the "lines" array. So I have tried using standard functions in spark with json_normalize or explode but it doesnt seem to work with this particular json format.

The endgoal is to get a dataframe looking like this:

I hope you can give me some advice on how to approach and handle these files in databricks.

how to flatten non standard Json files in a dataframe