Databricks Community

Christian_C · ‎04-17-2025

I am using delta live table and pub sub to ingest message from 30 different topics in parallel.
I noticed that initialization time can be very long around 15 minutes.
Does someone knows how to reduced initialization time in dlt ?

Thanks You

RiyazAliM · ‎04-21-2025

Hey @Christian_C

Typically, it'd take less than 5 mins for the DLT cluster to spin-up (if not serverless) and would stay active upto 120 minutes if you're in Development mode.

Serverless DLT has faster initialization times compared to the non-serverless ones. Check this out - https://docs.databricks.com/aws/en/dlt/configure-compute

Hope this helps!

Riz

Christian_C · ‎04-21-2025

Hi @RiyazAliM

Are you using Azure Databricks ?

I am using Google Pub / Sub and GCP Databricks.
Serverless is still not available in GCP at least not in my region.
I think my issue is related to Google Pub Sub who can not initialize connection with DLT in time.
Thank you

Christian

RiyazAliM · ‎04-23-2025

Hi @Christian_C ,

I've used Databricks on both Azure & AWS. I assumed the services for Databricks on all the cloud providers are same. Thank you for informing that Serverless is still not available on GCP.

Best!

Riz

Louis_Frolio · ‎04-23-2025

Classic clusters can take up to seven minutes to be acquired, configured, and deployed, with most of this time spent waiting for the cloud service to allocate virtual machines. In contrast, serverless clusters typically start in under eight seconds. I recommend testing serverless clusters in a development environment to observe the performance difference and to ensure your specific code is compatible with serverless deployments.

Let me know if you have any questions.
— Louis

Christian_C · ‎04-23-2025

Hello Louis,

As i mentionned gcp serverless is not available in my region europe-west1
https://docs.databricks.com/gcp/en/resources/feature-region-support

My performance issue is not related to cluster startup delay.

It is related to initialization of dlt to Google PubSub subscription.

Thanks you

TheRealOliver · ‎04-24-2025

More information is needed for effective troubleshooting 😉

How did you establish that the issue is not the cluster start-up time but delays in a Pub/Sub subscription?

What is your ingestion schedule?

What are your Pub/Sub connector options?

Please share your code that configures a read from Pub/Sub, if you can.

Have you checked out the streaming metrics?

Are there any errors or warning related to the pipeline or Pub/Sub in Google Cloud Logging?

GCP Audit Logs for Pub/Sub can be configured to log timestamps of read operations which can be cross-correlated with Databricks logs, if need be.

---

The last time I run a DLT pipeline on schedule on my GCP infra it took about 15 minutes to provision a Databricks compute cluster (a GKE NodePool, effectively).

I hope this helps!

Christian_C · ‎04-25-2025

Hi Oliver,

see below screenshot below.

first the cluster is initializing. (so the startup is finished the delay does not come from cluster startup)
then flow are appended to pipeline (around 30 flows).
these flows takes time to get initialize, and i believe there is a maximum of around 10 concurrent flow possible. Most of the flow are empty, no messages in queue. so i am expecting very fast processing time but it is not the case.

due to low volumetry (i am in development step not in production) i can not see any metrics.

code are paste below

dlt.create_streaming_table(f"events",
  comment = "events",
  schema="messageId String Not null, payload binary not null, attributes string not null, publishTimestampInMillis bigint not null"
  )

def append_flow_to_table(topic, authOpts):
    subscriptionId = f"sub_{topic}"

    @Dlt.append_flow(
        target = f"events",
        name = f"`{topic}`", 
        comment = f"{topic} flow"
        ) 
    def flow():
        return (
            spark.readStream
            .format("pubsub")
            .option("subscriptionId", subscriptionId)
            .option("topicId", topic)
            .option("projectId", projectId)
            .option("deleteSubscriptionOnStreamStop", "false")
            .options(**authOpts)
            .load()
        )

for pipeline in pipelines:
  print(pipeline['topic'])
  append_flow_to_table(pipeline['topic'], authOptions)



activeStreams = [q.name for q in spark.streams.active]
print("actives streams: ")
print(activeStreams)

I will try to look at the google log at the next run.

I run this delta live table pipeline once per day

thank you