<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Cluster configuration and optimal number for fs.s3a.connection.maximum , fs.s3a.threads.max in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/cluster-configuration-and-optimal-number-for-fs-s3a-connection/m-p/23346#M16095</link>
    <description>&lt;P&gt;Please could you suggest best cluster configuration for a use case stated below and tips to resolve the errors shown below -&lt;/P&gt;&lt;P&gt;Use case:&lt;/P&gt;&lt;P&gt;There could be 4 or 5 spark jobs that run concurrently.&lt;/P&gt;&lt;P&gt;Each job reads 40  input files and spits out 120 output files to s3 in csv firmat( three times of input file) &lt;/P&gt;&lt;P&gt;All concurrent jobs read the same 39 input files and just one file that will have the variation for a job&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Often the jobs fail with the following errors:&lt;/P&gt;&lt;P&gt;Job aborted due to stage failure: Task 0 in stage 3084.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3084.0 (TID...., ip..., executor 0): org.apache.spark.SparkExecution: Task failed while writing rows&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Job aborted due to stage failure: Task 0 in stage 3078.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3078.0 (TID...., ip..., executor 0): java.io.interruptedExecution: getFileStatus on s3:&amp;lt;file path&amp;gt; : com.amazonaws.SdkClientException: Unable to execute HTTP request. Timeout waiting for connection from pool&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Given below is my spark_conf&lt;/P&gt;&lt;P&gt;new SparkConf()&lt;/P&gt;&lt;P&gt;.set("spark.serializer", classOf[KryoSerializer].getName)&lt;/P&gt;&lt;P&gt;.set("spark.hadoop.fs.s3z.impl", "org.apache.hadoop.fs.s3a.s3AFileSystem")&lt;/P&gt;&lt;P&gt;.set("spark.hadoop.fs.s3a.connection.maximum", 400)&lt;/P&gt;&lt;P&gt;.set("fs.s3a.threads.max",200)&lt;/P&gt;&lt;P&gt;.set("spark.hadoop.fs.s3a.fast.upload",true)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Spark UI , Environment section shows &lt;/P&gt;&lt;P&gt;spark.hadoop.fs.s3a.connection.maximum = 200&lt;/P&gt;&lt;P&gt;fs.s3a.threads.max = 136 &lt;/P&gt;&lt;P&gt;and does not align with my setting&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Questions:&lt;/P&gt;&lt;P&gt;(1) What needs to be done for caching input files that are read for subsequent concurrent jobs to use? Would Storage optimized , Delta cache cluster config do this&lt;/P&gt;&lt;P&gt;(2) Why are'nt the numbers in SparkUI Environment match with my Spark conf setting&lt;/P&gt;&lt;P&gt;(3) How to resolve these job errors&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Vee&lt;/P&gt;</description>
    <pubDate>Thu, 07 Apr 2022 18:37:05 GMT</pubDate>
    <dc:creator>Vee</dc:creator>
    <dc:date>2022-04-07T18:37:05Z</dc:date>
    <item>
      <title>Cluster configuration and optimal number for fs.s3a.connection.maximum , fs.s3a.threads.max</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-configuration-and-optimal-number-for-fs-s3a-connection/m-p/23346#M16095</link>
      <description>&lt;P&gt;Please could you suggest best cluster configuration for a use case stated below and tips to resolve the errors shown below -&lt;/P&gt;&lt;P&gt;Use case:&lt;/P&gt;&lt;P&gt;There could be 4 or 5 spark jobs that run concurrently.&lt;/P&gt;&lt;P&gt;Each job reads 40  input files and spits out 120 output files to s3 in csv firmat( three times of input file) &lt;/P&gt;&lt;P&gt;All concurrent jobs read the same 39 input files and just one file that will have the variation for a job&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Often the jobs fail with the following errors:&lt;/P&gt;&lt;P&gt;Job aborted due to stage failure: Task 0 in stage 3084.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3084.0 (TID...., ip..., executor 0): org.apache.spark.SparkExecution: Task failed while writing rows&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Job aborted due to stage failure: Task 0 in stage 3078.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3078.0 (TID...., ip..., executor 0): java.io.interruptedExecution: getFileStatus on s3:&amp;lt;file path&amp;gt; : com.amazonaws.SdkClientException: Unable to execute HTTP request. Timeout waiting for connection from pool&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Given below is my spark_conf&lt;/P&gt;&lt;P&gt;new SparkConf()&lt;/P&gt;&lt;P&gt;.set("spark.serializer", classOf[KryoSerializer].getName)&lt;/P&gt;&lt;P&gt;.set("spark.hadoop.fs.s3z.impl", "org.apache.hadoop.fs.s3a.s3AFileSystem")&lt;/P&gt;&lt;P&gt;.set("spark.hadoop.fs.s3a.connection.maximum", 400)&lt;/P&gt;&lt;P&gt;.set("fs.s3a.threads.max",200)&lt;/P&gt;&lt;P&gt;.set("spark.hadoop.fs.s3a.fast.upload",true)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Spark UI , Environment section shows &lt;/P&gt;&lt;P&gt;spark.hadoop.fs.s3a.connection.maximum = 200&lt;/P&gt;&lt;P&gt;fs.s3a.threads.max = 136 &lt;/P&gt;&lt;P&gt;and does not align with my setting&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Questions:&lt;/P&gt;&lt;P&gt;(1) What needs to be done for caching input files that are read for subsequent concurrent jobs to use? Would Storage optimized , Delta cache cluster config do this&lt;/P&gt;&lt;P&gt;(2) Why are'nt the numbers in SparkUI Environment match with my Spark conf setting&lt;/P&gt;&lt;P&gt;(3) How to resolve these job errors&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Vee&lt;/P&gt;</description>
      <pubDate>Thu, 07 Apr 2022 18:37:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-configuration-and-optimal-number-for-fs-s3a-connection/m-p/23346#M16095</guid>
      <dc:creator>Vee</dc:creator>
      <dc:date>2022-04-07T18:37:05Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster configuration and optimal number for fs.s3a.connection.maximum , fs.s3a.threads.max</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-configuration-and-optimal-number-for-fs-s3a-connection/m-p/23347#M16096</link>
      <description>&lt;P&gt;Hi @Vetrivel Senthil​&amp;nbsp;, &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just wondering if this question is a duplicate from this one &lt;A href="https://community.databricks.com/s/feed/0D53f00001qvQJcCAM" target="test_blank"&gt;https://community.databricks.com/s/feed/0D53f00001qvQJcCAM&lt;/A&gt;? &lt;/P&gt;</description>
      <pubDate>Fri, 29 Apr 2022 22:09:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-configuration-and-optimal-number-for-fs-s3a-connection/m-p/23347#M16096</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-04-29T22:09:48Z</dc:date>
    </item>
  </channel>
</rss>

