<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Facing CANNOT_OPEN_SOCKET error after job cluster fails to upsacle to target nodes in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/facing-cannot-open-socket-error-after-job-cluster-fails-to/m-p/140313#M51380</link>
    <description>&lt;P&gt;Difficult to know but maybe it has to do with usage of &lt;STRONG&gt;spot&lt;/STRONG&gt; instances as it seems root cause is kind of random. In theory spot instances can be terminated at any time by cloud provider if it needs the capacity back, BUT databricks should handle this fact correctly to replace lost spot workers or apply resilient policies to avoid that type of errors.&lt;/P&gt;&lt;P&gt;So, I can't ensure that is your issue. However, you can try to disable that option for a given time taking into account that costs will be a little higher. In anycase, don't use "spot" instances in PROD unless your workloads can afford breaks.&lt;/P&gt;</description>
    <pubDate>Tue, 25 Nov 2025 13:31:30 GMT</pubDate>
    <dc:creator>Coffee77</dc:creator>
    <dc:date>2025-11-25T13:31:30Z</dc:date>
    <item>
      <title>Facing CANNOT_OPEN_SOCKET error after job cluster fails to upsacle to target nodes</title>
      <link>https://community.databricks.com/t5/data-engineering/facing-cannot-open-socket-error-after-job-cluster-fails-to/m-p/140311#M51379</link>
      <description>&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2025-11-25 at 6.08.19 PM.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21948i18D7368CA5AC4C25/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Screenshot 2025-11-25 at 6.08.19 PM.png" alt="Screenshot 2025-11-25 at 6.08.19 PM.png" /&gt;&lt;/span&gt;&lt;BR /&gt;This error pops up in my Databricks workflow 1 out of 10 times, and everytime it occurs I see the below message in event logs.&lt;BR /&gt;&lt;STRONG&gt; Compute upsize complete, but below target size. The current worker count is 1, out of a target of 3.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;And right after this my job cluster terminates with the socket error message.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2025-11-25 at 6.10.50 PM.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21949iF1F6BD9624DE7FC3/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Screenshot 2025-11-25 at 6.10.50 PM.png" alt="Screenshot 2025-11-25 at 6.10.50 PM.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;These are my cluster configs, if required.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2025-11-25 at 6.12.30 PM.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21950i3E0739CAB36E9294/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Screenshot 2025-11-25 at 6.12.30 PM.png" alt="Screenshot 2025-11-25 at 6.12.30 PM.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 25 Nov 2025 12:43:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/facing-cannot-open-socket-error-after-job-cluster-fails-to/m-p/140311#M51379</guid>
      <dc:creator>ashishCh</dc:creator>
      <dc:date>2025-11-25T12:43:33Z</dc:date>
    </item>
    <item>
      <title>Re: Facing CANNOT_OPEN_SOCKET error after job cluster fails to upsacle to target nodes</title>
      <link>https://community.databricks.com/t5/data-engineering/facing-cannot-open-socket-error-after-job-cluster-fails-to/m-p/140313#M51380</link>
      <description>&lt;P&gt;Difficult to know but maybe it has to do with usage of &lt;STRONG&gt;spot&lt;/STRONG&gt; instances as it seems root cause is kind of random. In theory spot instances can be terminated at any time by cloud provider if it needs the capacity back, BUT databricks should handle this fact correctly to replace lost spot workers or apply resilient policies to avoid that type of errors.&lt;/P&gt;&lt;P&gt;So, I can't ensure that is your issue. However, you can try to disable that option for a given time taking into account that costs will be a little higher. In anycase, don't use "spot" instances in PROD unless your workloads can afford breaks.&lt;/P&gt;</description>
      <pubDate>Tue, 25 Nov 2025 13:31:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/facing-cannot-open-socket-error-after-job-cluster-fails-to/m-p/140313#M51380</guid>
      <dc:creator>Coffee77</dc:creator>
      <dc:date>2025-11-25T13:31:30Z</dc:date>
    </item>
    <item>
      <title>Re: Facing CANNOT_OPEN_SOCKET error after job cluster fails to upsacle to target nodes</title>
      <link>https://community.databricks.com/t5/data-engineering/facing-cannot-open-socket-error-after-job-cluster-fails-to/m-p/140327#M51386</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/145827"&gt;@ashishCh&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The [CANNOT_OPEN_SOCKET] failures stem from PySpark’s default, socket‑based data transfer path used when collecting rows back to Python (e.g., .collect(), .first(), .take()), where the local handshake to a JVM‑opened ephemeral port on 127.0.0.1 intermittently times out or is refused.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This can happen due to Spot Instance termination/ Executor unresponsiveness due to memory/CPU pressure etc.&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;To mitigate this error, can you add the following Spark Configuration to your Job Compute Clusters:&lt;BR /&gt;&lt;/SPAN&gt;spark.databricks.pyspark.useFileBasedCollect true&lt;BR /&gt;&lt;SPAN&gt;&lt;BR /&gt;This switches the data transfer mechanism from sockets to temporary files, thereby avoiding reliance on the local network layer.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 25 Nov 2025 18:17:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/facing-cannot-open-socket-error-after-job-cluster-fails-to/m-p/140327#M51386</guid>
      <dc:creator>iyashk-DB</dc:creator>
      <dc:date>2025-11-25T18:17:11Z</dc:date>
    </item>
  </channel>
</rss>

