cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

reading data from url using spark

AryaMa
New Contributor III

reading data form url using spark ,community edition ,got a path related error ,any suggestions please ?

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) 
df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv

1 ACCEPTED SOLUTION

Accepted Solutions
13 REPLIES 13

DonatienTessier
New Contributor III

Hi @rr_5454,

You will find the answer here https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html

You will have to:

  1. get the file to local file storage
  2. move the file from dbfs

  3. load the file in a dataframe

This is one of the possible solutions.

THIAM_HUATTAN
Valued Contributor

I face the same issue, could you provide some code for assistance? thanks

dazfuller
Contributor III

With code for anyone facing the same issue, and without moving to a different path

import requests
 
CHUNK_SIZE=4096
 
with requests.get("https://raw.githubusercontent.com/suy1968/Adult.csv-Dataset/main/adult.csv", stream=True) as resp:
  if resp.ok:
    with open("/dbfs/FileStore/data/adult.csv", "wb") as f:
      for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
        f.write(chunk)
 
display(spark.read.csv("dbfs:/FileStore/data/adult.csv", header=True, inferSchema=True))

I had to use a different URL as the one in the original question was no longer available

lemfo
New Contributor II

Hi there,
I have pretty much the exact code you have here, and yet it still doesnt work, saying "No such file or directory"
Is this a limitation of the community edition?

import requests
CHUNK_SIZE=4096
def get_remote_file(dataSrcUrl, destFile):
    '''Simple old skool python function to load a remote url into local hdfs '''
    destFile = "/dbfs" + destFile
    #
    with requests.get(dataSrcUrl, stream=True) as resp:
        if resp.ok:
            with open(destFile, "wb") as f:
                for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
                    f.write(chunk)
get_remote_file("https://gitlab.com/opstar/share20/-/raw/master/university.json", "/Filestore/data/lgdt/university.json" )

The directory "dbfs:/Filestore/data/lgdt" definitely exists as i can see it when running the dbutils.fs.ls(path) command

RantoB
Valued Contributor

Hi,

I face the same issue as abose with the following error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv

unfortunatly this link is dead: https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html

Would it be possible to give the solution again ?

Thanks

RantoB
Valued Contributor

Anonymous
Not applicable

@Bertrand BURCKER​ - That's great! Would you be happy to mark your answer as best so that others can find it easily?

Thanks!

User16752246494
Contributor

Hi ,

We can also read CSV directly without writing it to DBFS.

Scala spark Approach

import org.apache.commons.io.IOUtils // jar will be already there in spark cluster no need to worry
import java.net.URL 
 
val urlfile=new URL("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")
  val testDummyCSV = IOUtils.toString(urlfile,"UTF-8").lines.toList.toDS()
  val testcsv = spark
                .read.option("header", true)
                .option("inferSchema", true)
                .csv(testDummyCSV)
display(testcsv)

image 

Notebook attached

weldermartins
Honored Contributor

hello everyone, this issue has not been resolved until today. I appreciate all the palliative ways. But shouldn't SparkFiles be able to extract data from an API? I tested SparkFiles on Community Databricks without errors, but on Azure it generates the path not found message.

RantoB
Valued Contributor

hi,

does the best answer of this post help you :

read csv directly from url with pyspark (databricks.com) ?

weldermartins
Honored Contributor

Hi, the concept of functional sparkfiles I already know, functionality within Azure that is not correct.

The discussion is here:

https://community.databricks.com/s/question/0D53f00001XD3pjCAD/sparkfiles-strange-behavior-on-azure-...

padang
New Contributor II

Sorry, bringing this back up...

from pyspark import SparkFiles
url = "http://raw.githubusercontent.com/ltregan/ds-data/main/authors.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv("file://"+SparkFiles.get("authors.csv"), header=True, inferSchema= True)
df.show()

I get this empty output:

++
||
++
++

Any idea ? Spark 3.2.2 on Mac M1

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.