Databricks Community

SRK · ‎12-21-2022

I am trying to read messages from kafka topic using spark.readstream, I am using the following code to read it.

My CODE:

df = spark.readStream

.format("kafka")

.option("kafka.bootstrap.servers", "192.1xx.1.1xx:9xx")

.option("subscribe", "json_topic")

.option("startingOffsets", "earliest") // From starting

.load()

Now i just want to get the count of df like we can get from df.count() method when we use spark.read.

I need to place some conditions if i didn't get any messages from the Topic. I am running this code as a batch and its a business requirement, i don't want to use spark.read.

Please suggest what would be the best approach to get the count.

Thanks in advance!

daniel_sahal · ‎12-22-2022

You can try this approach:

https://stackoverflow.com/questions/57568038/how-to-see-the-dataframe-in-the-console-equivalent-of-s...

ReadStream is running a thread in background so there's no easy way like df.show().

SRK · ‎12-22-2022

Thanks for the suggestion. I will check.

Databricks Community

How to get the count of dataframe rows when reading through spark.readstream using batch jobs?

Photos

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April