<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to fetch spark.addFiles when used multi node cluster in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132576#M49558</link>
    <description>&lt;P&gt;Currently in job for every batch&amp;nbsp; it refer these xml files from dbfs and this makes job bit slow so rather than reading it from dbfs I want to read it in executor memory&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 19 Sep 2025 15:33:58 GMT</pubDate>
    <dc:creator>Mahesh_rathi__</dc:creator>
    <dc:date>2025-09-19T15:33:58Z</dc:date>
    <item>
      <title>How to fetch spark.addFiles when used multi node cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132568#M49553</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I wanted to share the nearly 12 xml files from dbfs location to executor local path by using sc.addFile and I went to your blog and tweaked my code to form path with file:/// the result of it was it worked when we have only one node but throwed error when multiple nodes are used in cluster although I m using sparkfiles.get to fetch the path&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;code:&lt;BR /&gt;from pyspark import SparkFiles&lt;BR /&gt;patten_pool_path = "dbfs:/FileStore/Mrathi/pattern-pool"&lt;BR /&gt;sc.addFile(patten_pool_path, recursive=True)&lt;BR /&gt;patten_pool_path = SparkFiles.get("pattern-pool")&lt;BR /&gt;full_path = "file://" + patten_pool_path + "/"&lt;BR /&gt;print(full_path) #output == file:///local_disk0/spark-7f135ab4-231f-4649-8571-f375f8ac738f/userFiles-de1552f4-8e94-4625-a2ca-e2…&lt;BR /&gt;rdd = sc.textFile(full_path)&lt;BR /&gt;head_rdd = rdd.pipe("head -n 5")&lt;BR /&gt;print(head_rdd.collect())&lt;/P&gt;&lt;P&gt;output :&lt;BR /&gt;: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 16.0 failed 4 times, most recent failure: Lost task 5.3 in stage 16.0 (TID 278) (10.139.64.102 executor 0): java.io.FileNotFoundException: File file:/local_disk0/spark-7f135ab4-231f-4649-8571-f375f8ac738f/userFiles-de1552f4-8e94-4625-a2ca-e21ad2467b63/pattern-pool/Imaging_Measurement_Started_SubPattern_V1.xml does not exist&lt;/P&gt;</description>
      <pubDate>Fri, 19 Sep 2025 14:07:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132568#M49553</guid>
      <dc:creator>Mahesh_rathi__</dc:creator>
      <dc:date>2025-09-19T14:07:42Z</dc:date>
    </item>
    <item>
      <title>Re: How to fetch spark.addFiles when used multi node cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132574#M49557</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/185621"&gt;@Mahesh_rathi__&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;SparkContext.addFile&lt;/STRONG&gt;&lt;/EM&gt; is for shipping small side files to executors, not for creating an input path that you can pass to sc.textFile("file://...").&lt;/P&gt;
&lt;P&gt;On a single-node cluster the driver and executor share the same machine, so the driver’s local path “happens to work.” In a multi-node cluster each executor has its own userFiles-&amp;lt;uuid&amp;gt; directory, so the driver-computed file:///local_disk0/... path won’t exist on the other nodes—hence the &lt;STRONG&gt;FileNotFoundException&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can skip addFile and the file:// scheme entirely. Read from DBFS directly, and Spark will parallelise it across executors. Does this not work for you?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Sep 2025 15:29:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132574#M49557</guid>
      <dc:creator>K_Anudeep</dc:creator>
      <dc:date>2025-09-19T15:29:02Z</dc:date>
    </item>
    <item>
      <title>Re: How to fetch spark.addFiles when used multi node cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132576#M49558</link>
      <description>&lt;P&gt;Currently in job for every batch&amp;nbsp; it refer these xml files from dbfs and this makes job bit slow so rather than reading it from dbfs I want to read it in executor memory&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Sep 2025 15:33:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132576#M49558</guid>
      <dc:creator>Mahesh_rathi__</dc:creator>
      <dc:date>2025-09-19T15:33:58Z</dc:date>
    </item>
    <item>
      <title>Re: How to fetch spark.addFiles when used multi node cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132578#M49559</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/185621"&gt;@Mahesh_rathi__&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;If you want to read it in executor memory, can you broadcast the paths and the read?&lt;/P&gt;</description>
      <pubDate>Fri, 19 Sep 2025 15:52:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132578#M49559</guid>
      <dc:creator>K_Anudeep</dc:creator>
      <dc:date>2025-09-19T15:52:17Z</dc:date>
    </item>
    <item>
      <title>Re: How to fetch spark.addFiles when used multi node cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132579#M49560</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/185621"&gt;@Mahesh_rathi__&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;Sample code which might help you:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from pyspark import SparkFiles

xml_dir = "dbfs:/FileStore/Mrathi/pattern-pool"

files = [f for f in dbutils.fs.ls(xml_dir) if f.name.endswith(".xml")]
for file in files:
    sc.addFile(file.path)

file_names = [f.name for f in files]
files_bc = sc.broadcast(file_names)

def read_local_files(_):
    # This runs on executors. SparkFiles.get resolves the executor-local path.
    from pyspark import SparkFiles
    for name in files_bc.value:
        local_path = SparkFiles.get(name) 
        print(local_path)
        with open(local_path, "r") as fh:
            for line in fh:
                yield line

rdd = sc.parallelize([0], sc.defaultParallelism).flatMap(read_local_files)
print(rdd.take(5))
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Let me know if it works&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Sep 2025 16:03:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-fetch-spark-addfiles-when-used-multi-node-cluster/m-p/132579#M49560</guid>
      <dc:creator>K_Anudeep</dc:creator>
      <dc:date>2025-09-19T16:03:33Z</dc:date>
    </item>
  </channel>
</rss>

