Migrating jsonb data from Postgresql database to Databricks

Karene — Tue, 02 Jan 2024 10:15:14 GMT

Hi Team,

I am trying to create a pipeline to incrementally ingest data from an RDS postgresql database which contains tables that have some columns of jsonb data type. I am currently using AWS DMS with CDC to first load the data into an S3 bucket as csv files, and then using Databricks Autoloader to ingest the files into a streaming delta table.

Currently, the json data is being stored as a string data type, whereas I would like it to be stored as a struct data type so that it can be queried.

What is the best way to achieve this with Autoloader so that the ingested data has the jsonb columns as struct data types? This is the code I am using to ingest the data -

spark.readStream.format("cloudFiles")

.option("cloudFiles.format", "csv")

.option("cloudFiles.inferSchema", "true")

.option("cloudFiles.inferColumnTypes", "true")

.load("s3://path/to/bucket")

Thanks in advance!

Re: Migrating jsonb data from Postgresql database to Databricks

BR_DatabricksAI — Tue, 02 Jan 2024 11:35:57 GMT

Hello Karene,

You can do the transformation in following manner from string to struct and refer to the example below:

data =[('001','{"name":"bhupendra","zipcode":"260100"}')]

schema = ['id','propertytype']

df = spark.createDataFrame(data,schema)

df.show(truncate=False)

df.printSchema()

from pyspark.sql.functions import from_json

from pyspark.sql.types import StructType, StructField,StringType

structTypeSchema = StructType([\

StructField('name',StringType()),\

StructField('zipcode',StringType())])

df1 = df.withColumn('propertystructtype', from_json(df.propertytype, structTypeSchema))

df1.show(truncate=False)

df1.printSchema()

topic Migrating jsonb data from Postgresql database to Databricks in Data Engineering

Migrating jsonb data from Postgresql database to Databricks

Re: Migrating jsonb data from Postgresql database to Databricks