Databricks Community

AryaMa · ‎07-12-2019

reading data form url using spark ,community edition ,got a path related error ,any suggestions please ?

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) 
df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv

RantoB · ‎11-02-2021

I got an answer there :

read csv directly from url with pyspark (databricks.com)

thanks

View solution in original post

DonatienTessier · ‎07-16-2019

Hi @rr_5454,

You will find the answer here https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html

You will have to:

get the file to local file storage
move the file from dbfs
load the file in a dataframe

This is one of the possible solutions.

THIAM_HUATTAN · ‎08-08-2019

I face the same issue, could you provide some code for assistance? thanks

dazfuller · ‎09-28-2021

With code for anyone facing the same issue, and without moving to a different path

import requests
 
CHUNK_SIZE=4096
 
with requests.get("https://raw.githubusercontent.com/suy1968/Adult.csv-Dataset/main/adult.csv", stream=True) as resp:
  if resp.ok:
    with open("/dbfs/FileStore/data/adult.csv", "wb") as f:
      for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
        f.write(chunk)
 
display(spark.read.csv("dbfs:/FileStore/data/adult.csv", header=True, inferSchema=True))

I had to use a different URL as the one in the original question was no longer available

lemfo · ‎08-09-2023

Hi there,
I have pretty much the exact code you have here, and yet it still doesnt work, saying "No such file or directory"
Is this a limitation of the community edition?

import requests
CHUNK_SIZE=4096
def get_remote_file(dataSrcUrl, destFile):
    '''Simple old skool python function to load a remote url into local hdfs '''
    destFile = "/dbfs" + destFile
    #
    with requests.get(dataSrcUrl, stream=True) as resp:
        if resp.ok:
            with open(destFile, "wb") as f:
                for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
                    f.write(chunk)

get_remote_file("https://gitlab.com/opstar/share20/-/raw/master/university.json", "/Filestore/data/lgdt/university.json" )

The directory "dbfs:/Filestore/data/lgdt" definitely exists as i can see it when running the dbutils.fs.ls(path) command

RantoB · ‎10-29-2021

Hi,

I face the same issue as abose with the following error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv

unfortunatly this link is dead: https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html

Would it be possible to give the solution again ?

Thanks

Piper_Wilson · ‎10-29-2021

@Bertrand BURCKER - Try here - https://web.archive.org/web/20201030194155/https://forums.databricks.com/questions/10648/upload-loca...

RantoB · ‎11-02-2021

I got an answer there :

read csv directly from url with pyspark (databricks.com)

thanks

Anonymous · ‎11-02-2021

@Bertrand BURCKER - That's great! Would you be happy to mark your answer as best so that others can find it easily?

Thanks!

User16752246494 · ‎11-26-2021

Hi ,

We can also read CSV directly without writing it to DBFS.

Scala spark Approach

import org.apache.commons.io.IOUtils // jar will be already there in spark cluster no need to worry
import java.net.URL 
 
val urlfile=new URL("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")
  val testDummyCSV = IOUtils.toString(urlfile,"UTF-8").lines.toList.toDS()
  val testcsv = spark
                .read.option("header", true)
                .option("inferSchema", true)
                .csv(testDummyCSV)

display(testcsv)

Notebook attached

weldermartins · ‎12-13-2021

hello everyone, this issue has not been resolved until today. I appreciate all the palliative ways. But shouldn't SparkFiles be able to extract data from an API? I tested SparkFiles on Community Databricks without errors, but on Azure it generates the path not found message.

RantoB · ‎12-14-2021

hi,

does the best answer of this post help you :

read csv directly from url with pyspark (databricks.com) ?

weldermartins · ‎12-14-2021

Hi, the concept of functional sparkfiles I already know, functionality within Azure that is not correct.

The discussion is here:

https://community.databricks.com/s/question/0D53f00001XD3pjCAD/sparkfiles-strange-behavior-on-azure-...

padang · ‎03-01-2023

Sorry, bringing this back up...

from pyspark import SparkFiles
url = "http://raw.githubusercontent.com/ltregan/ds-data/main/authors.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv("file://"+SparkFiles.get("authors.csv"), header=True, inferSchema= True)
df.show()

I get this empty output: