topic Re: Flattening JSON while also keep embedded types in Data Engineering

Flattening JSON while also keep embedded types

amanda3 — Thu, 07 Nov 2024 00:39:45 GMT

I'm attempting to create DLT tables from a source table that includes an "data" column that is a JSON string. I'm doing something like this:

sales_schema = StructType([
  StructField("customer_id", IntegerType(), True),
  StructField("order_numbers", ArrayType(LongType()), True)
  StructField("data", StructType([
      StructField("value", IntegerType())
  ])
])
@dlt.table(
  schema=sales_schema)
def sales():
  df = spark.readStream.table("table_name")
     .withColumn("parsed_data", from_json(col("data"), sales_schema)
     .select("parsed_data.*")
  return df

this works to flatten, but for some reason the data column in the DLT sales table becomes a struct where "value" is no longer an Integer but a String.

is there a way to keep the original value type?

Re: Flattening JSON while also keep embedded types

Walter_C — Thu, 07 Nov 2024 15:35:44 GMT

To ensure that the "value" field retains its integer type, you can explicitly cast it after parsing the JSON.

from pyspark.sql.functions import col, from_json, expr from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, LongType sales_schema = StructType([ StructField("customer_id", IntegerType(), True), StructField("order_numbers", ArrayType(LongType()), True), StructField("data", StructType([ StructField("value", IntegerType()) ])) ]) @dlt.table( schema=sales_schema ) def sales(): df = spark.readStream.table("table_name") \ .withColumn("parsed_data", from_json(col("data"), sales_schema)) \ .select("parsed_data.*") \ .withColumn("value", col("value").cast(IntegerType())) return df

Re: Flattening JSON while also keep embedded types

amanda3 — Thu, 07 Nov 2024 16:04:23 GMT

would this be the only way i can handle this? i have a handful of schemas and they all are pretty deeply nested. also, how would i handle the "ArrayType(LongType)" case

Re: Flattening JSON while also keep embedded types

amanda3 — Thu, 07 Nov 2024 21:36:47 GMT

actually it turns out that the original solution works, it just a bit confusing because when you look at the Catalog view everything is wrapped in quotations, but if you describe the table, it's correct