I wanted to share the nearly 12 xml files from dbfs location to executor local path by using sc.addFile and I went to your blog and tweaked my code to form path with file:/// the result of it was it worked when we have only one node but throwed error when multiple nodes are used in cluster although I m using sparkfiles.get to fetch the path
code:
from pyspark import SparkFiles
patten_pool_path = "dbfs:/FileStore/Mrathi/pattern-pool"
sc.addFile(patten_pool_path, recursive=True)
patten_pool_path = SparkFiles.get("pattern-pool")
full_path = "file://" + patten_pool_path + "/"
print(full_path) #output == file:///local_disk0/spark-7f135ab4-231f-4649-8571-f375f8ac738f/userFiles-de1552f4-8e94-4625-a2ca-e2โฆ
rdd = sc.textFile(full_path)
head_rdd = rdd.pipe("head -n 5")
print(head_rdd.collect())
output :
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 16.0 failed 4 times, most recent failure: Lost task 5.3 in stage 16.0 (TID 278) (10.139.64.102 executor 0): java.io.FileNotFoundException: File file:/local_disk0/spark-7f135ab4-231f-4649-8571-f375f8ac738f/userFiles-de1552f4-8e94-4625-a2ca-e21ad2467b63/pattern-pool/Imaging_Measurement_Started_SubPattern_V1.xml does not exist