cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to get the count of dataframe rows when reading through spark.readstream using batch jobs?

SRK
Contributor III

I am trying to read messages from kafka topic using spark.readstream, I am using the following code to read it.

My CODE:

df = spark.readStream

.format("kafka")

.option("kafka.bootstrap.servers", "192.1xx.1.1xx:9xx")

.option("subscribe", "json_topic")

.option("startingOffsets", "earliest") // From starting

.load()

Now i just want to get the count of df like we can get from df.count() method when we use spark.read.

I need to place some conditions if i didn't get any messages from the Topic. I am running this code as a batch and its a business requirement, i don't want to use spark.read.

Please suggest what would be the best approach to get the count.

Thanks in advance!

2 REPLIES 2

daniel_sahal
Esteemed Contributor

You can try this approach:

https://stackoverflow.com/questions/57568038/how-to-see-the-dataframe-in-the-console-equivalent-of-s...

ReadStream is running a thread in background so there's no easy way like df.show().

Thanks for the suggestion. I will check.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.