<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Migrating jsonb data from Postgresql database to Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/migrating-jsonb-data-from-postgresql-database-to-databricks/m-p/56066#M30498</link>
    <description>&lt;P&gt;Hi Team,&lt;/P&gt;&lt;P&gt;I am trying to create a pipeline to incrementally ingest data from an RDS postgresql database which contains tables that have some columns of jsonb data type. I am currently using AWS DMS with CDC to first load the data into an S3 bucket as csv files, and then using Databricks Autoloader to ingest the files into a streaming delta table.&lt;/P&gt;&lt;P&gt;Currently, the json data is being stored as a string data type, whereas I would like it to be stored as a struct data type so that it can be queried.&lt;/P&gt;&lt;P&gt;What is the best way to achieve this with Autoloader so that the ingested data has the jsonb columns as struct data types? This is the code I am using to ingest the data -&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;spark.readStream.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"cloudFiles"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"cloudFiles.format"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"csv"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"cloudFiles.inferSchema"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"cloudFiles.inferColumnTypes"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"s3://path/to/bucket"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Thanks in advance!&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Tue, 02 Jan 2024 10:15:14 GMT</pubDate>
    <dc:creator>Karene</dc:creator>
    <dc:date>2024-01-02T10:15:14Z</dc:date>
    <item>
      <title>Migrating jsonb data from Postgresql database to Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/migrating-jsonb-data-from-postgresql-database-to-databricks/m-p/56066#M30498</link>
      <description>&lt;P&gt;Hi Team,&lt;/P&gt;&lt;P&gt;I am trying to create a pipeline to incrementally ingest data from an RDS postgresql database which contains tables that have some columns of jsonb data type. I am currently using AWS DMS with CDC to first load the data into an S3 bucket as csv files, and then using Databricks Autoloader to ingest the files into a streaming delta table.&lt;/P&gt;&lt;P&gt;Currently, the json data is being stored as a string data type, whereas I would like it to be stored as a struct data type so that it can be queried.&lt;/P&gt;&lt;P&gt;What is the best way to achieve this with Autoloader so that the ingested data has the jsonb columns as struct data types? This is the code I am using to ingest the data -&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;spark.readStream.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"cloudFiles"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"cloudFiles.format"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"csv"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"cloudFiles.inferSchema"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"cloudFiles.inferColumnTypes"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"s3://path/to/bucket"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Thanks in advance!&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 02 Jan 2024 10:15:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/migrating-jsonb-data-from-postgresql-database-to-databricks/m-p/56066#M30498</guid>
      <dc:creator>Karene</dc:creator>
      <dc:date>2024-01-02T10:15:14Z</dc:date>
    </item>
    <item>
      <title>Re: Migrating jsonb data from Postgresql database to Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/migrating-jsonb-data-from-postgresql-database-to-databricks/m-p/56079#M30504</link>
      <description>&lt;P&gt;Hello Karene,&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can do the transformation in following manner from string to struct and refer to the example below:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;data =[(&lt;/SPAN&gt;&lt;SPAN&gt;'001'&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;'{"name":"bhupendra","zipcode":"260100"}'&lt;/SPAN&gt;&lt;SPAN&gt;)]&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;schema = [&lt;/SPAN&gt;&lt;SPAN&gt;'id'&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;'propertytype'&lt;/SPAN&gt;&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;df = spark.createDataFrame(data,schema)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df.show(truncate=&lt;/SPAN&gt;&lt;SPAN&gt;False&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df.printSchema()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; pyspark.sql.functions &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; from_json&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; pyspark.sql.types &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; StructType, StructField,StringType&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;structTypeSchema = StructType([\&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; StructField(&lt;/SPAN&gt;&lt;SPAN&gt;'name'&lt;/SPAN&gt;&lt;SPAN&gt;,StringType()),\&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; StructField(&lt;/SPAN&gt;&lt;SPAN&gt;'zipcode'&lt;/SPAN&gt;&lt;SPAN&gt;,StringType())])&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;df1 = df.withColumn(&lt;/SPAN&gt;&lt;SPAN&gt;'propertystructtype'&lt;/SPAN&gt;&lt;SPAN&gt;, from_json(df.propertytype, structTypeSchema))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df1.show(truncate=&lt;/SPAN&gt;&lt;SPAN&gt;False&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df1.printSchema()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Jan 2024 11:35:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/migrating-jsonb-data-from-postgresql-database-to-databricks/m-p/56079#M30504</guid>
      <dc:creator>BR_DatabricksAI</dc:creator>
      <dc:date>2024-01-02T11:35:57Z</dc:date>
    </item>
  </channel>
</rss>

