Databricks

SankaraiahNaray · ‎12-24-2016

We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on

yarn-client

, its working fine in local mode.

We are submitting the spark job in

edge node

.

But when we place the file in local file path instead of HDFS, we are getting file not found exception.

Code:

sqlContext.read.format("com.databricks.spark.csv")
      .option("header", "true").option("inferSchema", "true")
      .load("file:/filepath/file.csv")

We also tried

file:///

, but still we are getting the same error.

Error log:

2016-12-24 16:05:40,044 WARN  [task-result-getter-0] scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hklvadcnc06.hk.standardchartered.com): java.io.FileNotFoundException: File file:/shared/sample1.csv does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
        at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
        at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
        at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
        at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:241)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Kaniz · ‎01-07-2022

Hi @Sankaraiah Narayanasamy ,

Seems like a bug in spark-shell command when reading a local file, But there is a workaround while running spark-submit command just specify in the command.

--conf "spark.authenticate=false"

SPARK-23476 for reference.

View solution in original post

VenkatKrishnan · ‎06-21-2017

The path should be with file:/// and it works for me. @

snsancar

Not sure if this got resolved for you or not. If not let me know know so that i can share my code.

gopinathsh · ‎01-22-2019

Hi, Please share your code to help me resolve the above issue as I am facing the same issue mentioned.

EricBellet · ‎10-28-2019

It works if a run the code using a Notebook, but if I use a Spark Submit or Python Submit job it doesn't work

kairaadvani · ‎01-22-2019

searching for the related issue for mykfcexperience and looking forward to getting a solution in this website

EricBellet · ‎10-24-2019

I tried in all the possible ways to read the files and I can't. With a Notebook it works, but I need to run a Spark Submit job and in that way it does not work

pdf = pd.read_csv("/databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv")
pdf2 = pd.read_csv("file:/databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv")
df3 = pd.read_csv("file:///databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv")

ajit1 · ‎03-08-2020

Does the file exist on executor node?

abhi_1825 · ‎12-17-2021

I am also not able to read a csv file from a C:\ drive location. Can anyone help? I get error as path doesnt exist.

Code snippet -

path = 'file:///C:/Users/folder_1/folder_2/folder_3/xyz.csv'

df = spark.read.csv(path)

Tried lots of combinations for above path but no success.

Anonymous · ‎12-20-2021

@Abhishek Pathak - My name is Piper, and I'm one of the moderators for Databricks. Thank you for posting your question! Let's see what the community has to say; otherwise, we'll circle back around to this.

abhi_1825 · ‎12-26-2021

Hi, Thanks for replying. Do we have any update on this? As far as i looked, it seems we cant read a local file directly. Is it the case.

Accordingly, can i connect to ADLS gen2 storage(Azure) while using community edition of Databricks? I am getting an error there as well.

Thank You.

Kaniz · ‎01-07-2022

Hi @Sankaraiah Narayanasamy ,

Seems like a bug in spark-shell command when reading a local file, But there is a workaround while running spark-submit command just specify in the command.

--conf "spark.authenticate=false"

SPARK-23476 for reference.

Databricks

Not able to read text file from local file path - Spark CSV reader

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs