How to get the count of dataframe rows when reading through spark.readstream using batch jobs?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā12-21-2022 08:29 AM
I am trying to read messages from kafka topic using spark.readstream, I am using the following code to read it.
My CODE:
df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.1xx.1.1xx:9xx")
.option("subscribe", "json_topic")
.option("startingOffsets", "earliest") // From starting
.load()
Now i just want to get the count of df like we can get from df.count() method when we use spark.read.
I need to place some conditions if i didn't get any messages from the Topic. I am running this code as a batch and its a business requirement, i don't want to use spark.read.
Please suggest what would be the best approach to get the count.
Thanks in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā12-22-2022 05:13 AM
You can try this approach:
ReadStream is running a thread in background so there's no easy way like df.show().
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā12-22-2022 10:07 PM
Thanks for the suggestion. I will check.

