10-29-2021 04:08 AM
I would like to load a csv file directly to a spark dataframe in Databricks. I tried the following code :
url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_for_header=true&csv_separator=%3B"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("eco2mix-national-tr.csv"), header=True, inferSchema= True)
and I got the following error :
Path does not exist: dbfs:/local_disk0/spark-c03e8325-0ab6-4c2e-bffb-c9d290283b31/userFiles-a507dd96-cc63-4e47-9b0f-44d2a940bb10/eco2mix-national-tr.csv
Thanks
10-29-2021 07:46 AM
ok so I tested it myself, and I think I found the issue:
the addfile() will not put a file called 'eco2mix-national-tr.csv', but a file called 'download'.
You can check this by using the %sh magic command and then
ls "/local_disk0/spark-.../userFiles-/"
You will get a list of files, no eco2mix but a 'download' file.
To see the contents of the download file, you can do a cat command:
%sh
cat "/local_disk0/spark-.../userFiles-.../download"
You will see the contents.
Next you have to read it with spark.read.csv AND the file:// prefix.
So this makes:
url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv"
from pyspark import SparkFiles
sc.addFile(url)
path = SparkFiles.get('download')
df = spark.read.csv("file://" + path, header=True, inferSchema= True, sep = ";")
This gives:
It is always a good idea when working with local files to actually look at the directory in question and do a cat of the file in question.
10-29-2021 04:27 AM
Check this:
https://stackoverflow.com/questions/57014043/reading-data-from-url-using-spark-databricks-platform
Basically adding "file://" to your path.
10-29-2021 04:45 AM
I've already read this post and tried it but this was not working either :
Path does not exist: file:/local_disk0/spark-48fd5772-d1a9-40f2-a2f2-bcad38962ed6/userFiles-0298f7e7-105c-4c8d-a845-0975edd378a0/eco2mix-national-tr.csv
10-29-2021 07:46 AM
ok so I tested it myself, and I think I found the issue:
the addfile() will not put a file called 'eco2mix-national-tr.csv', but a file called 'download'.
You can check this by using the %sh magic command and then
ls "/local_disk0/spark-.../userFiles-/"
You will get a list of files, no eco2mix but a 'download' file.
To see the contents of the download file, you can do a cat command:
%sh
cat "/local_disk0/spark-.../userFiles-.../download"
You will see the contents.
Next you have to read it with spark.read.csv AND the file:// prefix.
So this makes:
url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv"
from pyspark import SparkFiles
sc.addFile(url)
path = SparkFiles.get('download')
df = spark.read.csv("file://" + path, header=True, inferSchema= True, sep = ";")
This gives:
It is always a good idea when working with local files to actually look at the directory in question and do a cat of the file in question.
10-29-2021 09:22 AM
Great, this is working. Thank you.
10-29-2021 03:01 PM
@Bertrand BURCKER - If @Werner Stinckens answered your question, would you mark his as the best answer? That will help others find the solution quickly.
11-26-2021 06:16 AM
Hi ,
You can also use the following.
import org.apache.commons.io.IOUtils // jar will be already there in spark cluster no need to worry
import java.net.URL
val urlfile=new URL("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")
val testDummyCSV = IOUtils.toString(urlfile,"UTF-8").lines.toList.toDS()
val testcsv = spark
.read.option("header", true)
.option("inferSchema", true)
.csv(testDummyCSV)
display(testcsv)
10-12-2023 06:49 PM
I know it's a 2 years old thread but I needed to find a solution to this very thing today. I had one notebook using SparkContext
3 weeks ago
Hello it's end of 2024 and I still have this issue with python. As mentioned sc method nolonger works. Also, working with volumes within "/databricks/driver/" is not supported in Apache Spark.
ALTERNATIVE SOLUTION: Use requests to download the file from url and save to a DBFS path, "/FileStore/" which is accessible from Databricks.
url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_for_header=true&csv_separator=%3B"
local_path = "/FileStore/eco2mix-national-tr.csv"
# Use requests to download the file
response = requests.get(url)
with open("/dbfs" + local_path, "wb") as f:
f.write(response.content)
# Read the CSV with specific options
df = spark.read.csv(
path=local_path,
header=True,
inferSchema=True
)
df.show()
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group