07-12-2019 03:07 PM
reading data form url using spark ,community edition ,got a path related error ,any suggestions please ?
url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)
df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)
error:
Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv
11-02-2021 12:55 AM
07-16-2019 01:21 AM
Hi @rr_5454,
You will find the answer here https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html
You will have to:
This is one of the possible solutions.
08-08-2019 05:15 PM
I face the same issue, could you provide some code for assistance? thanks
09-28-2021 12:31 PM
With code for anyone facing the same issue, and without moving to a different path
import requests
CHUNK_SIZE=4096
with requests.get("https://raw.githubusercontent.com/suy1968/Adult.csv-Dataset/main/adult.csv", stream=True) as resp:
if resp.ok:
with open("/dbfs/FileStore/data/adult.csv", "wb") as f:
for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
f.write(chunk)
display(spark.read.csv("dbfs:/FileStore/data/adult.csv", header=True, inferSchema=True))
I had to use a different URL as the one in the original question was no longer available
08-09-2023 03:11 AM
Hi there,
I have pretty much the exact code you have here, and yet it still doesnt work, saying "No such file or directory"
Is this a limitation of the community edition?
import requests
CHUNK_SIZE=4096
def get_remote_file(dataSrcUrl, destFile):
'''Simple old skool python function to load a remote url into local hdfs '''
destFile = "/dbfs" + destFile
#
with requests.get(dataSrcUrl, stream=True) as resp:
if resp.ok:
with open(destFile, "wb") as f:
for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
f.write(chunk)
get_remote_file("https://gitlab.com/opstar/share20/-/raw/master/university.json", "/Filestore/data/lgdt/university.json" )
The directory "dbfs:/Filestore/data/lgdt" definitely exists as i can see it when running the dbutils.fs.ls(path) command
10-29-2021 04:00 AM
Hi,
I face the same issue as abose with the following error:
Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv
unfortunatly this link is dead: https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html
Would it be possible to give the solution again ?
Thanks
10-29-2021 01:44 PM
@Bertrand BURCKER - Try here - https://web.archive.org/web/20201030194155/https://forums.databricks.com/questions/10648/upload-loca...
11-02-2021 12:55 AM
11-02-2021 08:44 AM
@Bertrand BURCKER - That's great! Would you be happy to mark your answer as best so that others can find it easily?
Thanks!
11-26-2021 06:14 AM
Hi ,
We can also read CSV directly without writing it to DBFS.
Scala spark Approach
import org.apache.commons.io.IOUtils // jar will be already there in spark cluster no need to worry
import java.net.URL
val urlfile=new URL("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")
val testDummyCSV = IOUtils.toString(urlfile,"UTF-8").lines.toList.toDS()
val testcsv = spark
.read.option("header", true)
.option("inferSchema", true)
.csv(testDummyCSV)
display(testcsv)
Notebook attached
12-13-2021 09:48 AM
hello everyone, this issue has not been resolved until today. I appreciate all the palliative ways. But shouldn't SparkFiles be able to extract data from an API? I tested SparkFiles on Community Databricks without errors, but on Azure it generates the path not found message.
12-14-2021 12:04 AM
hi,
does the best answer of this post help you :
12-14-2021 03:56 AM
Hi, the concept of functional sparkfiles I already know, functionality within Azure that is not correct.
The discussion is here:
03-01-2023 01:10 PM
Sorry, bringing this back up...
from pyspark import SparkFiles
url = "http://raw.githubusercontent.com/ltregan/ds-data/main/authors.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv("file://"+SparkFiles.get("authors.csv"), header=True, inferSchema= True)
df.show()
I get this empty output:
++
||
++
++
Any idea ? Spark 3.2.2 on Mac M1
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group