cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Google Pub Sub and Delta live table

Christian_C
New Contributor II

I am using delta live table and pub sub to ingest message from 30 different topics in parallel. 
I noticed that initialization time can be very long around 15 minutes. 
Does someone knows how to reduced initialization time in dlt ? 

Thanks You

 

7 REPLIES 7

aayrm5
Honored Contributor

Hey @Christian_C 

Typically, it'd take less than 5 mins for the DLT cluster to spin-up (if not serverless) and would stay active upto 120 minutes if you're in Development mode. 

Serverless DLT has faster initialization times compared to the non-serverless ones. Check this out - https://docs.databricks.com/aws/en/dlt/configure-compute

Hope this helps!

Riz

Christian_C
New Contributor II

Hi @aayrm5 

Are you using Azure Databricks ?

I am using Google Pub / Sub and GCP Databricks. 
Serverless is still not available in GCP at least not in my region. 
I think my issue is related to Google Pub Sub who can not initialize connection with DLT in time. 
Thank you 

Christian

aayrm5
Honored Contributor

Hi @Christian_C ,

I've used Databricks on both Azure & AWS. I assumed the services for Databricks on all the cloud providers are same. Thank you for informing that Serverless is still not available on GCP.

Best!

Riz

BigRoux
Databricks Employee
Databricks Employee

Classic clusters can take up to seven minutes to be acquired, configured, and deployed, with most of this time spent waiting for the cloud service to allocate virtual machines. In contrast, serverless clusters typically start in under eight seconds. I recommend testing serverless clusters in a development environment to observe the performance difference and to ensure your specific code is compatible with serverless deployments.


Let me know if you have any questions.
— Louis

Christian_C
New Contributor II

Hello Louis,

As i mentionned gcp serverless is not available in my region europe-west1
https://docs.databricks.com/gcp/en/resources/feature-region-support

Christian_C_1-1745425019166.png

My performance issue is not related to cluster startup delay. 

It is related to initialization of dlt to Google PubSub subscription. 

Thanks you 

 

 

 

More information is needed for effective troubleshooting 😉

How did you establish that the issue is not the cluster start-up time but delays in a Pub/Sub subscription?

What is your ingestion schedule?

What are your Pub/Sub connector options?

Please share your code that configures a read from Pub/Sub, if you can.

Have you checked out the streaming metrics?

Are there any errors or warning related to the pipeline or Pub/Sub in Google Cloud Logging?

GCP Audit Logs for Pub/Sub can be configured to log timestamps of read operations which can be cross-correlated with Databricks logs, if need be.

---

The last time I run a DLT pipeline on schedule on my GCP infra it took about 15 minutes to provision a Databricks compute cluster (a GKE NodePool, effectively).

I hope this helps!

Hi Oliver,

see below screenshot below. 

first the cluster is initializing. (so the startup is finished the delay does not come from cluster startup)
then flow are appended to pipeline (around 30 flows).
these flows takes time to get initialize, and i believe there is a maximum of around 10 concurrent flow possible. Most of the flow are empty, no messages in queue. so i am expecting very fast processing time but it is not the case.

due to low volumetry (i am in development step not in production) i can not see any metrics.

Christian_C_0-1745601917828.png

code are paste below 

 

dlt.create_streaming_table(f"events",
  comment = "events",
  schema="messageId String Not null, payload binary not null, attributes string not null, publishTimestampInMillis bigint not null"
  )

def append_flow_to_table(topic, authOpts):
    subscriptionId = f"sub_{topic}"

    @Dlt.append_flow(
        target = f"events",
        name = f"`{topic}`", 
        comment = f"{topic} flow"
        ) 
    def flow():
        return (
            spark.readStream
            .format("pubsub")
            .option("subscriptionId", subscriptionId)
            .option("topicId", topic)
            .option("projectId", projectId)
            .option("deleteSubscriptionOnStreamStop", "false")
            .options(**authOpts)
            .load()
        )

for pipeline in pipelines:
  print(pipeline['topic'])
  append_flow_to_table(pipeline['topic'], authOptions)



activeStreams = [q.name for q in spark.streams.active]
print("actives streams: ")
print(activeStreams)

 

I will try to look at the google log at the next run.

I run this delta live table pipeline once per day

thank you

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now