cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Not able to read text file from local file path - Spark CSV reader

SankaraiahNaray
New Contributor II

We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on

yarn-client
, its working fine in local mode.

We are submitting the spark job in

edge node
.

But when we place the file in local file path instead of HDFS, we are getting file not found exception.

Code:

sqlContext.read.format("com.databricks.spark.csv")
      .option("header", "true").option("inferSchema", "true")
      .load("file:/filepath/file.csv")

We also tried

file:///
, but still we are getting the same error.

Error log:

2016-12-24 16:05:40,044 WARN  [task-result-getter-0] scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hklvadcnc06.hk.standardchartered.com): java.io.FileNotFoundException: File file:/shared/sample1.csv does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
        at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
        at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
        at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
        at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:241)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

10 REPLIES 10

VenkatKrishnan
New Contributor II

The path should be with file:/// and it works for me. @

snsancar

Not sure if this got resolved for you or not. If not let me know know so that i can share my code.

Hi, Please share your code to help me resolve the above issue as I am facing the same issue mentioned.

It works if a run the code using a Notebook, but if I use a Spark Submit or Python Submit job it doesn't work

kairaadvani
New Contributor II

searching for the related issue for mykfcexperience and looking forward to getting a solution in this website

EricBellet
New Contributor III

I tried in all the possible ways to read the files and I can't. With a Notebook it works, but I need to run a Spark Submit job and in that way it does not work

pdf = pd.read_csv("/databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv")
pdf2 = pd.read_csv("file:/databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv")
df3 = pd.read_csv("file:///databricks/driver/zipFiles/s3Sensor/2017/Tracking_Bounces_20190906.csv.zip/Bounces.csv")

ajit1
New Contributor II

Does the file exist on executor node?

abhi_1825
New Contributor III

I am also not able to read a csv file from a C:\ drive location. Can anyone help? I get error as path doesnt exist.

Code snippet -

path = 'file:///C:/Users/folder_1/folder_2/folder_3/xyz.csv'

df = spark.read.csv(path)

Tried lots of combinations for above path but no success.

Anonymous
Not applicable

@Abhishek Pathak​ - My name is Piper, and I'm one of the moderators for Databricks. Thank you for posting your question! Let's see what the community has to say; otherwise, we'll circle back around to this.

abhi_1825
New Contributor III

Hi, Thanks for replying. Do we have any update on this? As far as i looked, it seems we cant read a local file directly. Is it the case.

Accordingly, can i connect to ADLS gen2 storage(Azure) while using community edition of Databricks? I am getting an error there as well.

Thank You.

AshleeBall
New Contributor II

Thanks for your help. It helped me a lot.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group