cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

reading data from url using spark

AryaMa
New Contributor III

reading data form url using spark ,community edition ,got a path related error ,any suggestions please ?

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) 
df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv

1 ACCEPTED SOLUTION

Accepted Solutions

RantoB
Valued Contributor
13 REPLIES 13

DonatienTessier
Contributor

Hi @rr_5454,

You will find the answer here https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html

You will have to:

  1. get the file to local file storage
  2. move the file from dbfs

  3. load the file in a dataframe

This is one of the possible solutions.

THIAM_HUATTAN
Valued Contributor

I face the same issue, could you provide some code for assistance? thanks

dazfuller
Contributor III

With code for anyone facing the same issue, and without moving to a different path

import requests
 
CHUNK_SIZE=4096
 
with requests.get("https://raw.githubusercontent.com/suy1968/Adult.csv-Dataset/main/adult.csv", stream=True) as resp:
  if resp.ok:
    with open("/dbfs/FileStore/data/adult.csv", "wb") as f:
      for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
        f.write(chunk)
 
display(spark.read.csv("dbfs:/FileStore/data/adult.csv", header=True, inferSchema=True))

I had to use a different URL as the one in the original question was no longer available

lemfo
New Contributor II

Hi there,
I have pretty much the exact code you have here, and yet it still doesnt work, saying "No such file or directory"
Is this a limitation of the community edition?

import requests
CHUNK_SIZE=4096
def get_remote_file(dataSrcUrl, destFile):
    '''Simple old skool python function to load a remote url into local hdfs '''
    destFile = "/dbfs" + destFile
    #
    with requests.get(dataSrcUrl, stream=True) as resp:
        if resp.ok:
            with open(destFile, "wb") as f:
                for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
                    f.write(chunk)
get_remote_file("https://gitlab.com/opstar/share20/-/raw/master/university.json", "/Filestore/data/lgdt/university.json" )

The directory "dbfs:/Filestore/data/lgdt" definitely exists as i can see it when running the dbutils.fs.ls(path) command

RantoB
Valued Contributor

Hi,

I face the same issue as abose with the following error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv

unfortunatly this link is dead: https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html

Would it be possible to give the solution again ?

Thanks

Piper_Wilson
New Contributor III

RantoB
Valued Contributor

Anonymous
Not applicable

@Bertrand BURCKER​ - That's great! Would you be happy to mark your answer as best so that others can find it easily?

Thanks!

User16752246494
Contributor

Hi ,

We can also read CSV directly without writing it to DBFS.

Scala spark Approach

import org.apache.commons.io.IOUtils // jar will be already there in spark cluster no need to worry
import java.net.URL 
 
val urlfile=new URL("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")
  val testDummyCSV = IOUtils.toString(urlfile,"UTF-8").lines.toList.toDS()
  val testcsv = spark
                .read.option("header", true)
                .option("inferSchema", true)
                .csv(testDummyCSV)
display(testcsv)

image 

Notebook attached

weldermartins
Honored Contributor

hello everyone, this issue has not been resolved until today. I appreciate all the palliative ways. But shouldn't SparkFiles be able to extract data from an API? I tested SparkFiles on Community Databricks without errors, but on Azure it generates the path not found message.

RantoB
Valued Contributor

hi,

does the best answer of this post help you :

read csv directly from url with pyspark (databricks.com) ?

weldermartins
Honored Contributor

Hi, the concept of functional sparkfiles I already know, functionality within Azure that is not correct.

The discussion is here:

https://community.databricks.com/s/question/0D53f00001XD3pjCAD/sparkfiles-strange-behavior-on-azure-...

padang
New Contributor II

Sorry, bringing this back up...

from pyspark import SparkFiles
url = "http://raw.githubusercontent.com/ltregan/ds-data/main/authors.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv("file://"+SparkFiles.get("authors.csv"), header=True, inferSchema= True)
df.show()

I get this empty output:

++
||
++
++

Any idea ? Spark 3.2.2 on Mac M1

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group