<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Databricks Streaming: Recommended Cluster Types and Best Practices in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-streaming-recommended-cluster-types-and-best/m-p/137520#M50754</link>
    <description>&lt;P&gt;Hi Community, I recently built some streaming pipelines (Autoloader-based) that extract JSON data from the Data Lake and, after parsing and logging, dump it into the Delta Lake bronze layer. Since these are streaming pipelines, they are supposed to run indefinitely until I deliberately stop them. However, I’ve noticed that the Databricks clusters (All-Purpose Compute) tend to become unstable after a day or two of continuous execution.&lt;/P&gt;&lt;P&gt;To keep things running, I’ve currently implemented an optimizer job that’s scheduled daily to stop the cluster, restart it, and then re-trigger the streaming pipeline.&lt;/P&gt;&lt;P&gt;I feel this might not be a best practice. Could you please suggest what type of clusters are most suitable for streaming jobs/pipelines and what the best practices are for managing streaming systems in Databricks?&lt;/P&gt;</description>
    <pubDate>Tue, 04 Nov 2025 08:38:22 GMT</pubDate>
    <dc:creator>FarhanM</dc:creator>
    <dc:date>2025-11-04T08:38:22Z</dc:date>
    <item>
      <title>Databricks Streaming: Recommended Cluster Types and Best Practices</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-streaming-recommended-cluster-types-and-best/m-p/137520#M50754</link>
      <description>&lt;P&gt;Hi Community, I recently built some streaming pipelines (Autoloader-based) that extract JSON data from the Data Lake and, after parsing and logging, dump it into the Delta Lake bronze layer. Since these are streaming pipelines, they are supposed to run indefinitely until I deliberately stop them. However, I’ve noticed that the Databricks clusters (All-Purpose Compute) tend to become unstable after a day or two of continuous execution.&lt;/P&gt;&lt;P&gt;To keep things running, I’ve currently implemented an optimizer job that’s scheduled daily to stop the cluster, restart it, and then re-trigger the streaming pipeline.&lt;/P&gt;&lt;P&gt;I feel this might not be a best practice. Could you please suggest what type of clusters are most suitable for streaming jobs/pipelines and what the best practices are for managing streaming systems in Databricks?&lt;/P&gt;</description>
      <pubDate>Tue, 04 Nov 2025 08:38:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-streaming-recommended-cluster-types-and-best/m-p/137520#M50754</guid>
      <dc:creator>FarhanM</dc:creator>
      <dc:date>2025-11-04T08:38:22Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Streaming: Recommended Cluster Types and Best Practices</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-streaming-recommended-cluster-types-and-best/m-p/137523#M50755</link>
      <description>&lt;P&gt;When running streaming pipelines, the key is to design for &lt;STRONG&gt;stability and isolation,&amp;nbsp;&lt;/STRONG&gt;not to rely on restart jobs.&lt;/P&gt;&lt;P&gt;The first thing to do is &lt;STRONG&gt;run your streams on Jobs Compute, not All-Purpose clusters&lt;/STRONG&gt;. If available, &lt;STRONG&gt;use Serverless Jobs&lt;/STRONG&gt;. Each pipeline should have its own &lt;STRONG&gt;dedicated job cluster&lt;/STRONG&gt;, which ensures clean, isolated runtimes, consistent libraries, and automatic retries, all of which reduce drift and instability.&lt;/P&gt;&lt;P&gt;Choose a &lt;STRONG&gt;recent LTS Databricks Runtime with Photon&lt;/STRONG&gt; (e.g., 14.x or 15.x). Photon gives a real boost in JSON parsing and Delta writes. Enable &lt;STRONG&gt;autoscaling with a minimum of at least two workers&lt;/STRONG&gt; to prevent executors from churning, avoid min=0 for long-running streams.&lt;/P&gt;&lt;P&gt;You didn’t mention if you’re using &lt;STRONG&gt;Delta Live Tables (now called Declarative Pipelines)&lt;/STRONG&gt;, but that’s worth exploring. DLT automatically manages &lt;STRONG&gt;cluster lifecycles, recovery, data quality checks, autoscaling, and lineage,&lt;/STRONG&gt;&amp;nbsp;all built in.&lt;/P&gt;&lt;P&gt;In short:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;Run your workloads through &lt;STRONG&gt;Workflows (Jobs)&lt;/STRONG&gt; using &lt;STRONG&gt;Job or Serverless clusters&lt;/STRONG&gt; with retries, autoscaling floors, proper &lt;STRONG&gt;checkpoints&lt;/STRONG&gt;, &lt;STRONG&gt;file-notification mode&lt;/STRONG&gt;, and &lt;STRONG&gt;monitoring&lt;/STRONG&gt;.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;There’s no need for a separate optimizer job to stop and restart clusters, follow the &lt;STRONG&gt;checkpoint/notifications/small-file/state management hygiene&lt;/STRONG&gt; instead. You’ll find detailed guidance in the Databricks documentation on &lt;STRONG&gt;streaming best practices&lt;/STRONG&gt; and &lt;STRONG&gt;Auto Loader performance tuning&lt;/STRONG&gt;.&lt;/P&gt;</description>
      <pubDate>Tue, 04 Nov 2025 08:56:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-streaming-recommended-cluster-types-and-best/m-p/137523#M50755</guid>
      <dc:creator>bianca_unifeye</dc:creator>
      <dc:date>2025-11-04T08:56:53Z</dc:date>
    </item>
  </channel>
</rss>

