Databricks Community

Nastia · ‎06-24-2024

Hi guys!

I am having an issue with passing the "streaming flow" between layers of the DLT.

first layer "ETD_Bz" is passing through, but then "ETD_Flattened_Bz" is failing with "pyspark.errors.exceptions.captured.AnalysisException: Queries with streaming sources must be executed with writeStream.start();" error.

Code:

# works successfully

@dlt.table(

name="ETD_Bz",

temporary=False)

def Bronze():

return (spark.readStream

.format("delta")

.option("skipChangeCommits", "true")

.table("default.tbl_raw_etd_data")

)

# function that flattens json

def process_raw_data(raw_tbl_name) :

df = (spark.readStream # <- starts working once I am changing from readStream to read, but then it obviously stops processing incrementally

.format("delta")

.option("mergeSchema", "true")

.table(raw_tbl_name)

)

json_schema = spark.read.json(df.rdd.map(lambda row: row.JsonString)).schema # <- fails here

kafka_df = df.withColumn("JsonStruct", from_json(col("JsonString"), json_schema))

fj = FlattenJson()

kafka_flattened_json = fj.flatten_json(kafka_df)

return kafka_flattened_json

# failing layer

@dlt.table(

name="ETD_Flattened_Bz",

spark_conf = {"spark.databricks.delta.schema.autoMerge.enabled" : "true"},

temporary=False)

def Bronze_Flattend():

return process_raw_data("live.ETD_Bz")

any help appreciated! Thank you very much in advance

Nastia · ‎06-24-2024

UPDATE: tried adding writeStream.start() like error suggested + as per other posts and ended up with following error/code:

@dlt.table(

name="ETD_Bz",

temporary=False)

def Bronze():

return (spark.readStream

.format("delta")

.option("skipChangeCommits", "true")

.table("default.tbl_raw_etd_data")

)

# function that flattens json

def process_raw_data(df, batchId) :

json_schema = spark.read.json(df.rdd.map(lambda row: row.JsonString)).schema

kafka_df = df.withColumn("JsonStruct", from_json(col("JsonString"), json_schema))

fj = FlattenJson()

kafka_flattened_json = fj.flatten_json(kafka_df)

return kafka_flattened_json

@dlt.table(

name="ETD_Flattened_Bz",

spark_conf = {"spark.databricks.delta.schema.autoMerge.enabled" : "true"},

temporary=False)

def Bronze_Flattend():

stream = (spark.readStream

.format("delta")

.option("skipChangeCommits", "true")

.table("live.ETD_Bz")

.writeStream

.format("json")

.outputMode("append")

.foreachBatch(process_raw_data)

.table("default.tbl_bz_tmp_etd_data")

.start()

.awaitTermination())

return (spark.readStream

.format("delta")

.option("skipChangeCommits", "true")

.table("default.tbl_bz_tmp_etd_data")

)

getting following error:

"py4j.protocol.Py4JJavaError: An error occurred while calling o687.toTable. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find data source: foreachBatch. Please find packages at `https://spark.apache.org/third-party-projects.html`."

Databricks Community

DLT fails with Queries with streaming sources must be executed with writeStream.start();

Connect with Databricks Users in Your Area

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure

Securely share data, analytics and AI