Databricks Community

israelst · ‎01-15-2024

I want to stream data from kinesis using DLT. the Data is in json format. How can I use structured streaming to automatically infer the schema? I know auto-loader has this feature but it doesn't make sense for me to use autoloader since my data is streaming from kinesis...

Priyanka_Biswas · ‎01-16-2024

Hi @israelst

When working with Kinesis in Databricks, you can effectively handle various data formats including JSON, Avro, or bytes. The key is to appropriately decode the data in your Spark application.

Before reading your stream, define your data schema. For JSON data, this can be done manually using PySpark's StructType and StructField if you are aware of your data structure.

To read the stream, specify the source format as “kinesis” in your Databricks notebook.

df = spark.readStream.format("kinesis")
.option("streamName", "your_stream_name")
.option("initialPosition", "latest")
.load()

After defining the schema, use it to process the JSON data from Kinesis:

from pyspark.sql.functions import from_json
json_df = df.select(from_json(df.value.cast("string"), schema).alias("data")).select("data.*")

israelst · ‎01-17-2024

Thanks @Priyanka_Biswas, but as I wrote - I am aiming for automatic inference of the schema. AutoLoader already has this functionality... But it seems kinesis structured streaming doesn't...