Flattening JSON while also keep embedded types
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-06-2024 04:39 PM
I'm attempting to create DLT tables from a source table that includes an "data" column that is a JSON string. I'm doing something like this:
sales_schema = StructType([ StructField("customer_id", IntegerType(), True), StructField("order_numbers", ArrayType(LongType()), True)
StructField("data", StructType([
StructField("value", IntegerType())
])
]) @dlt.table( schema=sales_schema) def sales(): df = spark.readStream.table("table_name")
.withColumn("parsed_data", from_json(col("data"), sales_schema)
.select("parsed_data.*")
return df
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-07-2024 07:35 AM
To ensure that the "value" field retains its integer type, you can explicitly cast it after parsing the JSON.
from pyspark.sql.functions import col, from_json, expr
from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, LongType
sales_schema = StructType([
StructField("customer_id", IntegerType(), True),
StructField("order_numbers", ArrayType(LongType()), True),
StructField("data", StructType([
StructField("value", IntegerType())
]))
])
@dlt.table(
schema=sales_schema
)
def sales():
df = spark.readStream.table("table_name") \
.withColumn("parsed_data", from_json(col("data"), sales_schema)) \
.select("parsed_data.*") \
.withColumn("value", col("value").cast(IntegerType()))
return df
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-07-2024 08:04 AM
would this be the only way i can handle this? i have a handful of schemas and they all are pretty deeply nested. also, how would i handle the "ArrayType(LongType)" case
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-07-2024 01:36 PM
actually it turns out that the original solution works, it just a bit confusing because when you look at the Catalog view everything is wrapped in quotations, but if you describe the table, it's correct