cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

structured streaming schema inference

israelst
New Contributor III

I want to stream data from kinesis using DLT. the Data is in json format. How can I use structured streaming to automatically infer the schema? I know auto-loader has this feature but it doesn't make sense for me to use autoloader since my data is streaming from kinesis...

3 REPLIES 3

Priyanka_Biswas
Databricks Employee
Databricks Employee

Hi @israelst 

When working with Kinesis in Databricks, you can effectively handle various data formats including JSON, Avro, or bytes. The key is to appropriately decode the data in your Spark application.

Before reading your stream, define your data schema. For JSON data, this can be done manually using PySpark's StructType and StructField if you are aware of your data structure.

To read the stream, specify the source format as โ€œkinesisโ€ in your Databricks notebook. 

df = spark.readStream.format("kinesis")
.option("streamName", "your_stream_name")
.option("initialPosition", "latest")
.load()

After defining the schema, use it to process the JSON data from Kinesis:

from pyspark.sql.functions import from_json
json_df = df.select(from_json(df.value.cast("string"), schema).alias("data")).select("data.*")

Thanks @Priyanka_Biswas, but as I wrote - I am aiming for automatic inference of the schema. AutoLoader already has this functionality... But it seems kinesis structured streaming doesn't...

israelst
New Contributor III

I wanted to use Databricks for this. I don't want to depend on AWS Glue. Same way I could do it with AutoLoader...