cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks Pub-Sub Data Recon

Ajay-Pandey
Esteemed Contributor III

I am trying to setup a recon activity between GCP Pub-Sub and databricks, Is there any way to fetch the last 24hrs record count from Pub-Sub?

I tried but not got any direct solution for it, It will be great if any one can suggest me the way t#pubsub, #databrickso achieve it.

#pubsub #databricks

Ajay Kumar Pandey
3 REPLIES 3

Prabakar
Databricks Employee
Databricks Employee

 

To fetch the last 24 hours' record count from Pub/Sub, you can use the publishTimestampInMillis field in the Pub/Sub schema to filter the records based on their publish timestamp. You can use the current_timestamp() function in Databricks to get the current timestamp and subtract 24 hours from it to get the timestamp for 24 hours ago. Then you can use the filter() function to filter the records based on their publishTimestampInMillis field.

Here's an example code snippet that demonstrates how to fetch the last 24 hours' record count from Pub/Sub using Databricks:

 

 

import org.apache.spark.sql.functions._

val authOptions: Map[String, String] =
 Map("clientId" -> clientId,
 "clientEmail" -> clientEmail,
 "privateKey" -> privateKey,
 "privateKeyId" -> privateKeyId)

val pubsubDF = spark.readStream
 .format("pubsub")
 .option("subscriptionId", "mysub")
 .option("topicId", "mytopic")
 .option("projectId", "myproject")
 .options(authOptions)
 .load()

val last24HoursTimestamp = current_timestamp() - expr("INTERVAL 24 HOURS")

val last24HoursCount = pubsubDF
  .filter(col("publishTimestampInMillis") >= last24HoursTimestamp.cast("long"))
  .count()

println(s"Last 24 hours record count: $last24HoursCount")

 

Note that this code snippet assumes that you have already configured the Pub/Sub connector in Databricks and have the necessary authorization options. If you haven't done so, please refer to the documentation on Subscribe to Google Pub/Sub | Databricks on Google Cloud for more information.

Ajay-Pandey
Esteemed Contributor III

Hi @Prabakar  

Thanks for the quick reply, I am looking for direct data count on PUBSUB not in databricks as we have to verify how many records were there in PUBSUB and how many records we have received in databricks on last 24 hrs.

Ajay Kumar Pandey

Anonymous
Not applicable

Hi @Ajay-Pandey 

Hope you are well. Just wanted to see if you were able to find an answer to your question and would you like to mark an answer as best? It would be really helpful for the other members too.

Cheers!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group