cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Trigger.AvailableNow does not support maxOffsetsPerTrigger in Databricks runtime 10.3

SimonY
New Contributor III

Hello,

I ran a spark stream job to ingest data from kafka to test Trigger.AvailableNow.

What's environment the job run ?

1: Databricks runtime 10.3

2: Azure cloud

3: 1 Driver node + 3 work nodes( 14GB, 4core)

val maxOffsetsPerTrigger = "500"

spark.conf.set("spark.databricks.delta.autoCompact.enabled",      "auto")

...

val rdf = spark

 .readStream

 .format("kafka")

 .option("kafka.security.protocol", "SASL_PLAINTEXT")

 .option("kafka.sasl.mechanism",  "SCRAM-SHA-512")

 .option("kafka.sasl.jaas.config", "<>")

 .option("kafka.bootstrap.servers", servers)

 .option("subscribe",        topic)

 .option("startingOffsets",     "earliest") 

 .option("maxOffsetsPerTrigger",  maxOffsetsPerTrigger)

 .load()

rdf.writeStream

 .format("delta")

 .outputMode("append")

 .option("mergeSchema", "true")

 .option("checkpointLocation", ckpPath)

 .trigger(Trigger.AvailableNow)

 .start(tabPath)

 .awaitTermination()

What I expected to see:

1: The spark stream job can read all data from Kafka and then quit

2: The spark stream will apply maxOffsetsPerTrigger for each micro batch

What I see:

the Kafka topic has four partitions, it takes 5 hours to generate 4 huge data files.

part-00000-89afacf1-f2e6-4904-b313-080d48034859-c000.snappy.parquet

3/25/2022, 9:50:48 PM

Hot (Inferred)

Block blob

14.39 GiB

Available

part-00001-cf932ee2-8535-4dd6-9dab-e94b9292a438-c000.snappy.parquet

3/25/2022, 6:15:36 PM

Hot (Inferred)

Block blob

14.38 GiB

Available

part-00002-7d481793-10dc-4739-8c20-972cb6f18fd6-c000.snappy.parquet

3/25/2022, 6:15:22 PM

Hot (Inferred)

Block blob

14.41 GiB

Available

part-00003-17c88f26-f152-4b27-80cf-5ae372662950-c000.snappy.parquet

3/25/2022, 9:48:14 PM

Hot (Inferred)

Block blob

14.43 GiB

Available

1 ACCEPTED SOLUTION

Accepted Solutions

SimonY
New Contributor III

@Karli Watsica​ , thanks for help. This issue has been fixed in databricks 10.4 and spark 3.3.

[SPARK-36649] [SQL] Support Trigger.AvailableNow on Kafka data source

https://docs.databricks.com/release-notes/runtime/10.4.html

View solution in original post

3 REPLIES 3

Eulaliasw
New Contributor II

We’re constantly working to improve our features based on feedback like this, so I’ll be sure to share your request to the API product team.

usps liteblue

SimonY
New Contributor III

@Karli Watsica​ , thanks for help. This issue has been fixed in databricks 10.4 and spark 3.3.

[SPARK-36649] [SQL] Support Trigger.AvailableNow on Kafka data source

https://docs.databricks.com/release-notes/runtime/10.4.html

Anonymous
Not applicable

You'd be better off with 1 node with 12 cores than 3 nodes with 4 each. You're shuffles are going to be much better one 1 machine.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!