I’ve built an ingestion pipeline in Databricks consisting of two notebooks:
- The first notebook calls an external API every four minutes to retrieve the latest available data.
- Each API call returns approximately 109 rows.
- The API only exposes the most recently captured dataset. For example:
- A call at 12:50 returns data captured at 12:45
- A call at 12:55 returns data captured at 12:50
- Because of this, the pipeline needs to execute with minimal delay, otherwise the data may no longer be available.
- The second notebook performs a MERGE into the Silver layer using a business key plus timestamp to ensure idempotency.
My main challenge is deciding which compute option is most appropriate for this workload.
- An all-purpose cluster running 24/7 would work, but it seems unnecessarily expensive for such a lightweight workload.
- Job clusters are cheaper, but the startup latency makes them difficult to use in this scenario since the ingestion needs to happen very close to when the data becomes available.
- I’m considering serverless compute as a potential middle ground because of the faster startup times.
However, there is one complication:
- The notebooks depend on an internal Python package distributed as a wheel.
- With standard job clusters, I can define this as a cluster/job library dependency in the associated DAB job configuration.
- With serverless compute, it seems I would need to install it using %pip install, which feels less ideal and less declarative.
My questions are:
- Would serverless compute be the best option for this kind of near-real-time ingestion workload?
- Are there better architectural patterns for handling frequent API polling with low latency in Databricks?
Thanks in advance for any guidance!