<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Flattening JSON while also keep embedded types in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/flattening-json-while-also-keep-embedded-types/m-p/98141#M39619</link>
    <description>&lt;P&gt;actually it turns out that the original solution works, it just a bit confusing because when you look at the Catalog view everything is wrapped in quotations, but if you describe the table, it's correct&lt;/P&gt;</description>
    <pubDate>Thu, 07 Nov 2024 21:36:47 GMT</pubDate>
    <dc:creator>amanda3</dc:creator>
    <dc:date>2024-11-07T21:36:47Z</dc:date>
    <item>
      <title>Flattening JSON while also keep embedded types</title>
      <link>https://community.databricks.com/t5/data-engineering/flattening-json-while-also-keep-embedded-types/m-p/98026#M39588</link>
      <description>&lt;P&gt;I'm attempting to create DLT tables from a source table that includes an "data" column that is a JSON string. I'm doing something like this:&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;sales_schema&lt;/SPAN&gt; &lt;SPAN class=""&gt;=&lt;/SPAN&gt; &lt;SPAN class=""&gt;StructType&lt;/SPAN&gt;&lt;SPAN class=""&gt;([&lt;/SPAN&gt;
  &lt;SPAN class=""&gt;StructField&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;&lt;SPAN class=""&gt;"customer_id"&lt;/SPAN&gt;&lt;SPAN class=""&gt;,&lt;/SPAN&gt; &lt;SPAN class=""&gt;IntegerType&lt;/SPAN&gt;&lt;SPAN class=""&gt;(),&lt;/SPAN&gt; &lt;SPAN class=""&gt;True&lt;/SPAN&gt;&lt;SPAN class=""&gt;),&lt;/SPAN&gt;
  &lt;SPAN class=""&gt;StructField&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;&lt;SPAN class=""&gt;"order_numbers"&lt;/SPAN&gt;&lt;SPAN class=""&gt;,&lt;/SPAN&gt; ArrayType(&lt;SPAN class=""&gt;LongType&lt;/SPAN&gt;&lt;SPAN class=""&gt;()),&lt;/SPAN&gt; &lt;SPAN class=""&gt;True&lt;/SPAN&gt;&lt;SPAN class=""&gt;)&lt;BR /&gt;  StructField("data", StructType([&lt;BR /&gt;      StructField("value", IntegerType())&lt;BR /&gt;  ])&lt;BR /&gt;]&lt;/SPAN&gt;&lt;SPAN class=""&gt;)&lt;/SPAN&gt;
&lt;SPAN class=""&gt;@dlt&lt;/SPAN&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN class=""&gt;table&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;
  &lt;SPAN class=""&gt;schema&lt;/SPAN&gt;&lt;SPAN class=""&gt;=&lt;/SPAN&gt;&lt;SPAN class=""&gt;sales_schema&lt;/SPAN&gt;&lt;SPAN class=""&gt;)&lt;/SPAN&gt;
&lt;SPAN class=""&gt;def&lt;/SPAN&gt; &lt;SPAN class=""&gt;sales&lt;/SPAN&gt;&lt;SPAN class=""&gt;():&lt;/SPAN&gt;
  df = spark.readStream.table("table_name")&lt;BR /&gt;     .withColumn("parsed_data", from_json(col("data"), sales_schema)&lt;BR /&gt;     .select("parsed_data.*")&lt;BR /&gt;  return df&lt;/PRE&gt;&lt;DIV&gt;this works to flatten, but for some reason the data column in the DLT sales table becomes a struct where "value" is no longer an Integer but a String.&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;is there a way to keep the original value type?&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 07 Nov 2024 00:39:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/flattening-json-while-also-keep-embedded-types/m-p/98026#M39588</guid>
      <dc:creator>amanda3</dc:creator>
      <dc:date>2024-11-07T00:39:45Z</dc:date>
    </item>
    <item>
      <title>Re: Flattening JSON while also keep embedded types</title>
      <link>https://community.databricks.com/t5/data-engineering/flattening-json-while-also-keep-embedded-types/m-p/98098#M39605</link>
      <description>&lt;P&gt;To ensure that the "value" field retains its integer type, you can explicitly cast it after parsing the JSON.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;from pyspark.sql.functions import col, from_json, expr
from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, LongType

sales_schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("order_numbers", ArrayType(LongType()), True),
    StructField("data", StructType([
        StructField("value", IntegerType())
    ]))
])

@dlt.table(
    schema=sales_schema
)
def sales():
    df = spark.readStream.table("table_name") \
        .withColumn("parsed_data", from_json(col("data"), sales_schema)) \
        .select("parsed_data.*") \
        .withColumn("value", col("value").cast(IntegerType()))
    return df&lt;/LI-CODE&gt;</description>
      <pubDate>Thu, 07 Nov 2024 15:35:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/flattening-json-while-also-keep-embedded-types/m-p/98098#M39605</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2024-11-07T15:35:44Z</dc:date>
    </item>
    <item>
      <title>Re: Flattening JSON while also keep embedded types</title>
      <link>https://community.databricks.com/t5/data-engineering/flattening-json-while-also-keep-embedded-types/m-p/98108#M39610</link>
      <description>&lt;P&gt;would this be the only way i can handle this? i have a handful of schemas and they all are pretty deeply nested. also, how would i handle the "ArrayType(LongType)" case&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 07 Nov 2024 16:04:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/flattening-json-while-also-keep-embedded-types/m-p/98108#M39610</guid>
      <dc:creator>amanda3</dc:creator>
      <dc:date>2024-11-07T16:04:23Z</dc:date>
    </item>
    <item>
      <title>Re: Flattening JSON while also keep embedded types</title>
      <link>https://community.databricks.com/t5/data-engineering/flattening-json-while-also-keep-embedded-types/m-p/98141#M39619</link>
      <description>&lt;P&gt;actually it turns out that the original solution works, it just a bit confusing because when you look at the Catalog view everything is wrapped in quotations, but if you describe the table, it's correct&lt;/P&gt;</description>
      <pubDate>Thu, 07 Nov 2024 21:36:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/flattening-json-while-also-keep-embedded-types/m-p/98141#M39619</guid>
      <dc:creator>amanda3</dc:creator>
      <dc:date>2024-11-07T21:36:47Z</dc:date>
    </item>
  </channel>
</rss>

