<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Best Compute Option for Near-Real-Time Databricks API Ingestion Pipeline in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/best-compute-option-for-near-real-time-databricks-api-ingestion/m-p/157224#M54517</link>
    <description>&lt;P&gt;I’ve built an ingestion pipeline in Databricks consisting of two notebooks:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;The first notebook calls an external API every four minutes to retrieve the latest available data.&lt;UL&gt;&lt;LI&gt;Each API call returns approximately 109 rows.&lt;/LI&gt;&lt;LI&gt;The API only exposes the &lt;EM&gt;most recently captured&lt;/EM&gt; dataset. For example:&lt;UL&gt;&lt;LI&gt;A call at 12:50 returns data captured at 12:45&lt;/LI&gt;&lt;LI&gt;A call at 12:55 returns data captured at 12:50&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;Because of this, the pipeline needs to execute with minimal delay, otherwise the data may no longer be available.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;The second notebook performs a MERGE into the Silver layer using a business key plus timestamp to ensure idempotency.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;My main challenge is deciding which compute option is most appropriate for this workload.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;An all-purpose cluster running 24/7 would work, but it seems unnecessarily expensive for such a lightweight workload.&lt;/LI&gt;&lt;LI&gt;Job clusters are cheaper, but the startup latency makes them difficult to use in this scenario since the ingestion needs to happen very close to when the data becomes available.&lt;/LI&gt;&lt;LI&gt;I’m considering serverless compute as a potential middle ground because of the faster startup times.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;However, there is one complication:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The notebooks depend on an internal Python package distributed as a wheel.&lt;/LI&gt;&lt;LI&gt;With standard job clusters, I can define this as a cluster/job library dependency in the associated DAB job configuration.&lt;/LI&gt;&lt;LI&gt;With serverless compute, it seems I would need to install it using %pip install, which feels less ideal and less declarative.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;My questions are:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Would serverless compute be the best option for this kind of near-real-time ingestion workload?&lt;/LI&gt;&lt;LI&gt;Are there better architectural patterns for handling frequent API polling with low latency in Databricks?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Thanks in advance for any guidance!&lt;/P&gt;</description>
    <pubDate>Tue, 19 May 2026 07:38:44 GMT</pubDate>
    <dc:creator>mnissen1337</dc:creator>
    <dc:date>2026-05-19T07:38:44Z</dc:date>
    <item>
      <title>Best Compute Option for Near-Real-Time Databricks API Ingestion Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/best-compute-option-for-near-real-time-databricks-api-ingestion/m-p/157224#M54517</link>
      <description>&lt;P&gt;I’ve built an ingestion pipeline in Databricks consisting of two notebooks:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;The first notebook calls an external API every four minutes to retrieve the latest available data.&lt;UL&gt;&lt;LI&gt;Each API call returns approximately 109 rows.&lt;/LI&gt;&lt;LI&gt;The API only exposes the &lt;EM&gt;most recently captured&lt;/EM&gt; dataset. For example:&lt;UL&gt;&lt;LI&gt;A call at 12:50 returns data captured at 12:45&lt;/LI&gt;&lt;LI&gt;A call at 12:55 returns data captured at 12:50&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;Because of this, the pipeline needs to execute with minimal delay, otherwise the data may no longer be available.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;The second notebook performs a MERGE into the Silver layer using a business key plus timestamp to ensure idempotency.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;My main challenge is deciding which compute option is most appropriate for this workload.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;An all-purpose cluster running 24/7 would work, but it seems unnecessarily expensive for such a lightweight workload.&lt;/LI&gt;&lt;LI&gt;Job clusters are cheaper, but the startup latency makes them difficult to use in this scenario since the ingestion needs to happen very close to when the data becomes available.&lt;/LI&gt;&lt;LI&gt;I’m considering serverless compute as a potential middle ground because of the faster startup times.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;However, there is one complication:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The notebooks depend on an internal Python package distributed as a wheel.&lt;/LI&gt;&lt;LI&gt;With standard job clusters, I can define this as a cluster/job library dependency in the associated DAB job configuration.&lt;/LI&gt;&lt;LI&gt;With serverless compute, it seems I would need to install it using %pip install, which feels less ideal and less declarative.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;My questions are:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Would serverless compute be the best option for this kind of near-real-time ingestion workload?&lt;/LI&gt;&lt;LI&gt;Are there better architectural patterns for handling frequent API polling with low latency in Databricks?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Thanks in advance for any guidance!&lt;/P&gt;</description>
      <pubDate>Tue, 19 May 2026 07:38:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-compute-option-for-near-real-time-databricks-api-ingestion/m-p/157224#M54517</guid>
      <dc:creator>mnissen1337</dc:creator>
      <dc:date>2026-05-19T07:38:44Z</dc:date>
    </item>
    <item>
      <title>Re: Best Compute Option for Near-Real-Time Databricks API Ingestion Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/best-compute-option-for-near-real-time-databricks-api-ingestion/m-p/157227#M54518</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/229258"&gt;@mnissen1337&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I would use serverless for that use case. It takes a time for job cluster to spin up (of course you can use pools, but given that your job needs to run every 5 minutes it doesn't make much sense), so serverless seems to be a great fit.&lt;/P&gt;&lt;P&gt;Regarding your concern about installing packages every time. Serveless compute let's you define custom environment that can be reused. What's great&amp;nbsp; is that e&lt;SPAN&gt;nvironments cache installed packages, which reduces startup latency for subsequent runs.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;If my answer was helpful, please consider marking it as accepted solution.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/compute/serverless/dependencies#create-a-custom-environment-specification" target="_blank"&gt;Configure the serverless environment | Databricks on AWS&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 May 2026 08:05:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-compute-option-for-near-real-time-databricks-api-ingestion/m-p/157227#M54518</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2026-05-19T08:05:41Z</dc:date>
    </item>
    <item>
      <title>Re: Best Compute Option for Near-Real-Time Databricks API Ingestion Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/best-compute-option-for-near-real-time-databricks-api-ingestion/m-p/157228#M54519</link>
      <description>&lt;P&gt;Thanks, I will look into that!&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Would it still make sense to seperate it into two notebooks (one for the ingestion part, loading into bronze layer) and one for merging it into silver layer to ensure idempotency or would you just include everything in the same notebook?&lt;/P&gt;</description>
      <pubDate>Tue, 19 May 2026 08:13:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-compute-option-for-near-real-time-databricks-api-ingestion/m-p/157228#M54519</guid>
      <dc:creator>mnissen1337</dc:creator>
      <dc:date>2026-05-19T08:13:37Z</dc:date>
    </item>
    <item>
      <title>Re: Best Compute Option for Near-Real-Time Databricks API Ingestion Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/best-compute-option-for-near-real-time-databricks-api-ingestion/m-p/157233#M54521</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/229258"&gt;@mnissen1337&lt;/a&gt;&amp;nbsp;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I would keep them separate.&amp;nbsp;With a single notebook you lose the ability to rerun just the silver merge independently -&amp;nbsp; if the merge fails or produces bad data, you'd have to either rerun the full ingestion or add conditional logic to skip the bronze step, which gets messy fast.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;If my answer was helpful, please consider marking it as accepted solution.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 May 2026 09:15:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-compute-option-for-near-real-time-databricks-api-ingestion/m-p/157233#M54521</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2026-05-19T09:15:06Z</dc:date>
    </item>
  </channel>
</rss>

