Databricks Community

mnissen1337 · ‎05-19-2026

I’ve built an ingestion pipeline in Databricks consisting of two notebooks:

The first notebook calls an external API every four minutes to retrieve the latest available data.
- Each API call returns approximately 109 rows.
- The API only exposes the most recently captured dataset. For example:
  - A call at 12:50 returns data captured at 12:45
  - A call at 12:55 returns data captured at 12:50
- Because of this, the pipeline needs to execute with minimal delay, otherwise the data may no longer be available.
The second notebook performs a MERGE into the Silver layer using a business key plus timestamp to ensure idempotency.

My main challenge is deciding which compute option is most appropriate for this workload.

An all-purpose cluster running 24/7 would work, but it seems unnecessarily expensive for such a lightweight workload.
Job clusters are cheaper, but the startup latency makes them difficult to use in this scenario since the ingestion needs to happen very close to when the data becomes available.
I’m considering serverless compute as a potential middle ground because of the faster startup times.

However, there is one complication:

The notebooks depend on an internal Python package distributed as a wheel.
With standard job clusters, I can define this as a cluster/job library dependency in the associated DAB job configuration.
With serverless compute, it seems I would need to install it using %pip install, which feels less ideal and less declarative.

My questions are:

Would serverless compute be the best option for this kind of near-real-time ingestion workload?
Are there better architectural patterns for handling frequent API polling with low latency in Databricks?

Thanks in advance for any guidance!

szymon_dybczak · ‎05-19-2026

Hi @mnissen1337 ,

I would keep them separate. With a single notebook you lose the ability to rerun just the silver merge independently - if the merge fails or produces bad data, you'd have to either rerun the full ingestion or add conditional logic to skip the bronze step, which gets messy fast.

If my answer was helpful, please consider marking it as accepted solution.

View solution in original post

szymon_dybczak · ‎05-19-2026

Hi @mnissen1337 ,

I would use serverless for that use case. It takes a time for job cluster to spin up (of course you can use pools, but given that your job needs to run every 5 minutes it doesn't make much sense), so serverless seems to be a great fit.

Regarding your concern about installing packages every time. Serveless compute let's you define custom environment that can be reused. What's great is that environments cache installed packages, which reduces startup latency for subsequent runs.

If my answer was helpful, please consider marking it as accepted solution.

Configure the serverless environment | Databricks on AWS

mnissen1337 · ‎05-19-2026

Thanks, I will look into that!

Would it still make sense to seperate it into two notebooks (one for the ingestion part, loading into bronze layer) and one for merging it into silver layer to ensure idempotency or would you just include everything in the same notebook?

szymon_dybczak · ‎05-19-2026

Hi @mnissen1337 ,

I would keep them separate. With a single notebook you lose the ability to rerun just the silver merge independently - if the merge fails or produces bad data, you'd have to either rerun the full ingestion or add conditional logic to skip the bronze step, which gets messy fast.

If my answer was helpful, please consider marking it as accepted solution.

Databricks Community

Best Compute Option for Near-Real-Time Databricks API Ingestion Pipeline

🌟 Community Pulse: Your Weekly Roundup! June 29 – July 05, 2026

📌‌ Complete Your Profile – Help Others Get to Know You

Solution Accelerator Series | Identify Fraud With Geospatial Analytics and AI

Upcoming Community BrickTalk: Bringing (Geo)Spatial Awareness to your Conversational Agents

Databricks Community Champion - June 2026 - Amira Bedhiafi