Databricks Community

Karene · ‎01-02-2024

Hi Team,

I am trying to create a pipeline to incrementally ingest data from an RDS postgresql database which contains tables that have some columns of jsonb data type. I am currently using AWS DMS with CDC to first load the data into an S3 bucket as csv files, and then using Databricks Autoloader to ingest the files into a streaming delta table.

Currently, the json data is being stored as a string data type, whereas I would like it to be stored as a struct data type so that it can be queried.

What is the best way to achieve this with Autoloader so that the ingested data has the jsonb columns as struct data types? This is the code I am using to ingest the data -

spark.readStream.format("cloudFiles")

.option("cloudFiles.format", "csv")

.option("cloudFiles.inferSchema", "true")

.option("cloudFiles.inferColumnTypes", "true")

.load("s3://path/to/bucket")

Thanks in advance!

BR_DatabricksAI · ‎01-02-2024

Hello Karene,

You can do the transformation in following manner from string to struct and refer to the example below:

data =[('001','{"name":"bhupendra","zipcode":"260100"}')]

schema = ['id','propertytype']

df = spark.createDataFrame(data,schema)

df.show(truncate=False)

df.printSchema()

from pyspark.sql.functions import from_json

from pyspark.sql.types import StructType, StructField,StringType

structTypeSchema = StructType([\

StructField('name',StringType()),\

StructField('zipcode',StringType())])

df1 = df.withColumn('propertystructtype', from_json(df.propertytype, structTypeSchema))

df1.show(truncate=False)

df1.printSchema()

Databricks Community

Migrating jsonb data from Postgresql database to Databricks

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon