topic Re: reading data from url using spark in Data Engineering

reading data from url using spark

AryaMa — Fri, 12 Jul 2019 22:07:30 GMT

reading data form url using spark ,community edition ,got a path related error ,any suggestions please ?

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) 
df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv

Re: reading data from url using spark

DonatienTessier — Tue, 16 Jul 2019 08:21:08 GMT

Hi @rr_5454,

You will find the answer here https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html

You will have to:

get the file to local file storage
move the file from dbfs
load the file in a dataframe

This is one of the possible solutions.

Re: reading data from url using spark

THIAM_HUATTAN — Fri, 09 Aug 2019 00:15:45 GMT

I face the same issue, could you provide some code for assistance? thanks

Re: reading data from url using spark

dazfuller — Tue, 28 Sep 2021 19:31:54 GMT

With code for anyone facing the same issue, and without moving to a different path

import requests
 
CHUNK_SIZE=4096
 
with requests.get("https://raw.githubusercontent.com/suy1968/Adult.csv-Dataset/main/adult.csv", stream=True) as resp:
  if resp.ok:
    with open("/dbfs/FileStore/data/adult.csv", "wb") as f:
      for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):
        f.write(chunk)
 
display(spark.read.csv("dbfs:/FileStore/data/adult.csv", header=True, inferSchema=True))

I had to use a different URL as the one in the original question was no longer available

Re: reading data from url using spark

RantoB — Fri, 29 Oct 2021 11:00:31 GMT

Hi,

I face the same issue as abose with the following error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv

unfortunatly this link is dead: https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html

Would it be possible to give the solution again ?

Thanks

Re: reading data from url using spark

Piper_Wilson — Fri, 29 Oct 2021 20:44:58 GMT

@Bertrand BURCKER - Try here - https://web.archive.org/web/20201030194155/https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html

Re: reading data from url using spark

RantoB — Tue, 02 Nov 2021 07:55:24 GMT

I got an answer there :

read csv directly from url with pyspark (databricks.com)

thanks

Re: reading data from url using spark

Anonymous — Tue, 02 Nov 2021 15:44:32 GMT

@Bertrand BURCKER - That's great! Would you be happy to mark your answer as best so that others can find it easily?

Thanks!

Re: reading data from url using spark

User16752246494 — Fri, 26 Nov 2021 14:14:20 GMT

Hi ,

We can also read CSV directly without writing it to DBFS.

Scala spark Approach

import org.apache.commons.io.IOUtils // jar will be already there in spark cluster no need to worry
import java.net.URL 
 
val urlfile=new URL("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")
  val testDummyCSV = IOUtils.toString(urlfile,"UTF-8").lines.toList.toDS()
  val testcsv = spark
                .read.option("header", true)
                .option("inferSchema", true)
                .csv(testDummyCSV)

display(testcsv)

Notebook attached

Re: reading data from url using spark

weldermartins — Mon, 13 Dec 2021 17:48:49 GMT

hello everyone, this issue has not been resolved until today. I appreciate all the palliative ways. But shouldn't SparkFiles be able to extract data from an API? I tested SparkFiles on Community Databricks without errors, but on Azure it generates the path not found message.

Re: reading data from url using spark

RantoB — Tue, 14 Dec 2021 08:04:48 GMT

hi,

does the best answer of this post help you :

read csv directly from url with pyspark (databricks.com) ?

Re: reading data from url using spark

weldermartins — Tue, 14 Dec 2021 11:56:34 GMT

Hi, the concept of functional sparkfiles I already know, functionality within Azure that is not correct.

The discussion is here:

https://community.databricks.com/s/question/0D53f00001XD3pjCAD/sparkfiles-strange-behavior-on-azure-databricks-runtime-10

Re: reading data from url using spark

padang — Wed, 01 Mar 2023 21:10:07 GMT

Sorry, bringing this back up...

from pyspark import SparkFiles
url = "http://raw.githubusercontent.com/ltregan/ds-data/main/authors.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv("file://"+SparkFiles.get("authors.csv"), header=True, inferSchema= True)
df.show()

I get this empty output:

++
||
++
++

Any idea ? Spark 3.2.2 on Mac M1

Re: reading data from url using spark

lemfo — Wed, 09 Aug 2023 10:11:04 GMT

Hi there,
I have pretty much the exact code you have here, and yet it still doesnt work, saying "No such file or directory"
Is this a limitation of the community edition?

import requests CHUNK_SIZE=4096 def get_remote_file(dataSrcUrl, destFile): '''Simple old skool python function to load a remote url into local hdfs ''' destFile = "/dbfs" + destFile # with requests.get(dataSrcUrl, stream=True) as resp: if resp.ok: with open(destFile, "wb") as f: for chunk in resp.iter_content(chunk_size=CHUNK_SIZE): f.write(chunk)

get_remote_file("https://gitlab.com/opstar/share20/-/raw/master/university.json", "/Filestore/data/lgdt/university.json" )

The directory "dbfs:/Filestore/data/lgdt" definitely exists as i can see it when running the dbutils.fs.ls(path) command