12-24-2016 01:01 AM
We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on
yarn-client
, its working fine in local mode.
We are submitting the spark job in
edge node
.
But when we place the file in local file path instead of HDFS, we are getting file not found exception.
Code:
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true").option("inferSchema", "true")
.load("file:/filepath/file.csv")
We also tried
file:///
, but still we are getting the same error.
Error log:
2016-12-24 16:05:40,044 WARN [task-result-getter-0] scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hklvadcnc06.hk.standardchartered.com): java.io.FileNotFoundException: File file:/shared/sample1.csv does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:241)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
01-07-2022 09:23 PM
Hi @Sankaraiah Narayanasamy ,
Seems like a bug in spark-shell command when reading a local file, But there is a workaround while running spark-submit command just specify in the command.
--conf "spark.authenticate=false"
SPARK-23476 for reference.
06-21-2017 01:01 AM
The path should be with file:/// and it works for me. @
snsancar
Not sure if this got resolved for you or not. If not let me know know so that i can share my code.01-22-2019 02:30 AM
Hi, Please share your code to help me resolve the above issue as I am facing the same issue mentioned.
10-28-2019 12:03 AM
It works if a run the code using a Notebook, but if I use a Spark Submit or Python Submit job it doesn't work
01-22-2019 01:06 PM
searching for the related issue for mykfcexperience and looking forward to getting a solution in this website
10-24-2019 11:38 AM
I tried in all the possible ways to read the files and I can't. With a Notebook it works, but I need to run a Spark Submit job and in that way it does not work
pdf = pd.read_csv("/databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv")
pdf2 = pd.read_csv("file:/databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv")
df3 = pd.read_csv("file:///databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv")
03-08-2020 08:35 AM
Does the file exist on executor node?
12-17-2021 04:47 AM
I am also not able to read a csv file from a C:\ drive location. Can anyone help? I get error as path doesnt exist.
Code snippet -
path = 'file:///C:/Users/folder_1/folder_2/folder_3/xyz.csv'
df = spark.read.csv(path)
Tried lots of combinations for above path but no success.
12-20-2021 12:22 PM
@Abhishek Pathak - My name is Piper, and I'm one of the moderators for Databricks. Thank you for posting your question! Let's see what the community has to say; otherwise, we'll circle back around to this.
12-26-2021 10:31 PM
Hi, Thanks for replying. Do we have any update on this? As far as i looked, it seems we cant read a local file directly. Is it the case.
Accordingly, can i connect to ADLS gen2 storage(Azure) while using community edition of Databricks? I am getting an error there as well.
Thank You.
01-07-2022 09:23 PM
Hi @Sankaraiah Narayanasamy ,
Seems like a bug in spark-shell command when reading a local file, But there is a workaround while running spark-submit command just specify in the command.
--conf "spark.authenticate=false"
SPARK-23476 for reference.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.