<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: read csv directly from url with pyspark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/49073#M28465</link>
    <description>&lt;P&gt;I know it's a 2 years old thread but I needed to find a solution to this very thing today. I had one notebook using &lt;STRONG&gt;SparkContext&lt;/STRONG&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;from pyspark import SparkFiles&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;from pyspark.sql.functions import *&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;sc.addFile(url)&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;But according to the runtime 14 release notes: &lt;A href="https://learn.microsoft.com/en-gb/azure/databricks/release-notes/runtime/14.0#breaking-changes" target="_blank"&gt;https://learn.microsoft.com/en-gb/azure/databricks/release-notes/runtime/14.0#breaking-changes&lt;/A&gt;&amp;nbsp;sc will stop working. IOUtils hasn't been mentioned.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;The current official way is:&lt;/DIV&gt;&lt;DIV&gt;&lt;A href="https://docs.databricks.com/en/files/download-internet-files.html" target="_blank"&gt;https://docs.databricks.com/en/files/download-internet-files.html&lt;/A&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I hope it helps.&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Fri, 13 Oct 2023 01:49:13 GMT</pubDate>
    <dc:creator>MartinIsti</dc:creator>
    <dc:date>2023-10-13T01:49:13Z</dc:date>
    <item>
      <title>read csv directly from url with pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12053#M6920</link>
      <description>&lt;P&gt;I would like to load a csv file directly to a spark dataframe in Databricks. I tried the following code :&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv&amp;amp;timezone=Europe/Berlin&amp;amp;lang=fr&amp;amp;use_labels_for_header=true&amp;amp;csv_separator=%3B"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("eco2mix-national-tr.csv"), header=True, inferSchema= True)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;and I got the following error :&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Path does not exist: dbfs:/local_disk0/spark-c03e8325-0ab6-4c2e-bffb-c9d290283b31/userFiles-a507dd96-cc63-4e47-9b0f-44d2a940bb10/eco2mix-national-tr.csv&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 11:08:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12053#M6920</guid>
      <dc:creator>RantoB</dc:creator>
      <dc:date>2021-10-29T11:08:48Z</dc:date>
    </item>
    <item>
      <title>Re: read csv directly from url with pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12054#M6921</link>
      <description>&lt;P&gt;Check this:&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/57014043/reading-data-from-url-using-spark-databricks-platform" alt="https://stackoverflow.com/questions/57014043/reading-data-from-url-using-spark-databricks-platform" target="_blank"&gt;https://stackoverflow.com/questions/57014043/reading-data-from-url-using-spark-databricks-platform&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Basically adding "file://" to your path.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 11:27:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12054#M6921</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-29T11:27:02Z</dc:date>
    </item>
    <item>
      <title>Re: read csv directly from url with pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12055#M6922</link>
      <description>&lt;P&gt;I've already read this post and tried it but this was not working either :&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Path does not exist: file:/local_disk0/spark-48fd5772-d1a9-40f2-a2f2-bcad38962ed6/userFiles-0298f7e7-105c-4c8d-a845-0975edd378a0/eco2mix-national-tr.csv&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 11:45:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12055#M6922</guid>
      <dc:creator>RantoB</dc:creator>
      <dc:date>2021-10-29T11:45:45Z</dc:date>
    </item>
    <item>
      <title>Re: read csv directly from url with pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12056#M6923</link>
      <description>&lt;P&gt;ok so I tested it myself, and I think I found the issue:&lt;/P&gt;&lt;P&gt;the addfile() will not put a file called 'eco2mix-national-tr.csv', but a file called 'download'.&lt;/P&gt;&lt;P&gt;You can check this by using the %sh magic command and then&lt;/P&gt;&lt;P&gt; ls "/local_disk0/spark-.../userFiles-/"&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You will get a list of files, no eco2mix but a 'download' file.&lt;/P&gt;&lt;P&gt;To see the contents of the download file, you can do a cat command:&lt;/P&gt;&lt;P&gt;%sh&lt;/P&gt;&lt;P&gt;cat "/local_disk0/spark-.../userFiles-.../download"&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You will see the contents.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Next you have to read it with spark.read.csv AND the file:// prefix.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So this makes:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv"
 from pyspark import SparkFiles
sc.addFile(url)
&amp;nbsp;
path  = SparkFiles.get('download')
df = spark.read.csv("file://" + path, header=True, inferSchema= True, sep = ";")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This gives:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2354i02DA75FF056CF58E/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It is always a good idea when working with local files to actually look at the directory in question and do a cat of the file in question.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 14:46:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12056#M6923</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-29T14:46:00Z</dc:date>
    </item>
    <item>
      <title>Re: read csv directly from url with pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12057#M6924</link>
      <description>&lt;P&gt;Great, this is working. Thank you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 16:22:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12057#M6924</guid>
      <dc:creator>RantoB</dc:creator>
      <dc:date>2021-10-29T16:22:05Z</dc:date>
    </item>
    <item>
      <title>Re: read csv directly from url with pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12058#M6925</link>
      <description>&lt;P&gt;@Bertrand BURCKER​&amp;nbsp;- If @Werner Stinckens​&amp;nbsp;answered your question, would you mark his as the best answer? That will help others find the solution quickly. &lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 22:01:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12058#M6925</guid>
      <dc:creator>Piper_Wilson</dc:creator>
      <dc:date>2021-10-29T22:01:53Z</dc:date>
    </item>
    <item>
      <title>Re: read csv directly from url with pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12060#M6927</link>
      <description>&lt;P&gt;Hi ,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You can also use the following.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import org.apache.commons.io.IOUtils // jar will be already there in spark cluster no need to worry
import java.net.URL 
&amp;nbsp;
val urlfile=new URL("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")
  val testDummyCSV = IOUtils.toString(urlfile,"UTF-8").lines.toList.toDS()
  val testcsv = spark
                .read.option("header", true)
                .option("inferSchema", true)
                .csv(testDummyCSV)
&amp;nbsp;
display(testcsv)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Nov 2021 14:16:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/12060#M6927</guid>
      <dc:creator>User16752246494</dc:creator>
      <dc:date>2021-11-26T14:16:46Z</dc:date>
    </item>
    <item>
      <title>Re: read csv directly from url with pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/49073#M28465</link>
      <description>&lt;P&gt;I know it's a 2 years old thread but I needed to find a solution to this very thing today. I had one notebook using &lt;STRONG&gt;SparkContext&lt;/STRONG&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;from pyspark import SparkFiles&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;from pyspark.sql.functions import *&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;sc.addFile(url)&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;But according to the runtime 14 release notes: &lt;A href="https://learn.microsoft.com/en-gb/azure/databricks/release-notes/runtime/14.0#breaking-changes" target="_blank"&gt;https://learn.microsoft.com/en-gb/azure/databricks/release-notes/runtime/14.0#breaking-changes&lt;/A&gt;&amp;nbsp;sc will stop working. IOUtils hasn't been mentioned.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;The current official way is:&lt;/DIV&gt;&lt;DIV&gt;&lt;A href="https://docs.databricks.com/en/files/download-internet-files.html" target="_blank"&gt;https://docs.databricks.com/en/files/download-internet-files.html&lt;/A&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I hope it helps.&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 13 Oct 2023 01:49:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/49073#M28465</guid>
      <dc:creator>MartinIsti</dc:creator>
      <dc:date>2023-10-13T01:49:13Z</dc:date>
    </item>
    <item>
      <title>Re: read csv directly from url with pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/101576#M40729</link>
      <description>&lt;P&gt;Hello it's end of 2024 and I still have this issue with python. As mentioned&amp;nbsp;&lt;STRONG&gt;sc&lt;/STRONG&gt; method nolonger works. Also, &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/files/" target="_blank" rel="noopener"&gt;working with volumes within "&lt;STRONG&gt;/databricks/driver/&lt;/STRONG&gt;" is not supported in Apache Spark.&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot (342).png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13420iE07BAF6CED31A5AB/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Screenshot (342).png" alt="Screenshot (342).png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;ALTERNATIVE SOLUTION:&lt;/STRONG&gt; Use&amp;nbsp;&lt;STRONG&gt;requests&lt;/STRONG&gt; to download the file from url and save to a&amp;nbsp;&lt;STRONG&gt;DBFS path,&amp;nbsp;&lt;/STRONG&gt;"/FileStore/" which is accessible from Databricks.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv&amp;amp;timezone=Europe/Berlin&amp;amp;lang=fr&amp;amp;use_labels_for_header=true&amp;amp;csv_separator=%3B"

local_path = "/FileStore/eco2mix-national-tr.csv"

# Use requests to download the file
response = requests.get(url)
with open("/dbfs" + local_path, "wb") as f:
    f.write(response.content)

# Read the CSV with specific options
df = spark.read.csv(
    path=local_path,
    header=True,
    inferSchema=True
)

df.show()&lt;/LI-CODE&gt;</description>
      <pubDate>Tue, 10 Dec 2024 09:41:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-csv-directly-from-url-with-pyspark/m-p/101576#M40729</guid>
      <dc:creator>anwangari</dc:creator>
      <dc:date>2024-12-10T09:41:13Z</dc:date>
    </item>
  </channel>
</rss>

