Flattening JSON while also keep embedded types

amanda3
New Contributor II

I'm attempting to create DLT tables from a source table that includes an "data" column that is a JSON string. I'm doing something like this: 

sales_schema = StructType([
  StructField("customer_id", IntegerType(), True),
  StructField("order_numbers", ArrayType(LongType()), True)
StructField("data", StructType([
StructField("value", IntegerType())
])
]
) @dlt.table( schema=sales_schema) def sales(): df = spark.readStream.table("table_name")
.withColumn("parsed_data", from_json(col("data"), sales_schema)
.select("parsed_data.*")
return df
this works to flatten, but for some reason the data column in the DLT sales table becomes a struct where "value" is no longer an Integer but a String. 
 
is there a way to keep the original value type?

 

Walter_C
Databricks Employee
Databricks Employee

To ensure that the "value" field retains its integer type, you can explicitly cast it after parsing the JSON.

from pyspark.sql.functions import col, from_json, expr
from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, LongType

sales_schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("order_numbers", ArrayType(LongType()), True),
    StructField("data", StructType([
        StructField("value", IntegerType())
    ]))
])

@dlt.table(
    schema=sales_schema
)
def sales():
    df = spark.readStream.table("table_name") \
        .withColumn("parsed_data", from_json(col("data"), sales_schema)) \
        .select("parsed_data.*") \
        .withColumn("value", col("value").cast(IntegerType()))
    return df

amanda3
New Contributor II

would this be the only way i can handle this? i have a handful of schemas and they all are pretty deeply nested. also, how would i handle the "ArrayType(LongType)" case

 

amanda3
New Contributor II

actually it turns out that the original solution works, it just a bit confusing because when you look at the Catalog view everything is wrapped in quotations, but if you describe the table, it's correct