- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-06-2021 09:29 AM
When you use:
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
it adds file to NON dbfs /local_disk0/ but then when you want to read file:
spark.read.json(SparkFiles.get("file_name"))
it wants to read it from /dbfs/local_disk0/. I tried also with file:// and many other creative ways and it doesn't work.
Of course it is working after using %sh cp - moving from /local_disk0/ to /dbfs/local_disk0/ .
It seems to be a bug like addFile was switched to dbfs on azure databricks but SparkFiles not (in original spark it addFile and gets to/from workers).
I couldn't find also any settings to manually specify RootDirectory for SparkFiles.
- Labels:
-
Azure
-
Azure databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2022 04:42 AM
Hi @Kaniz Fatma, Ticket Number: #00125834.
It's been over a month since the ticket was opened, but still no response.
I tested it now with version 3.2.0 of Apache Spark on the Azure platform, it continues the same way with the message: "File not found". But in community.cloud.databricks the path is found and returns the expected result.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2022 05:14 AM
municipios = "https://servicodados.ibge.gov.br/api/v1/localidades/municipios"
from pyspark import SparkFiles
spark.sparkContext.addFile(municipios)
municipiosDF = spark.read.option("multiLine", True).option("mode", "OVERRIDE").json("file://"+SparkFiles.get("municipios"))
I did not understand.
Please change the code above as instructed by you. @Kaniz Fatma
att,
Welder Martins
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2022 08:55 AM
Hi @Kaniz Fatma (Databricks), it ran without errors. The problem is that SparkFiles doesn't work on the Azure platform. I'm extracting data from the API with other functionality. I'm even using the URLLIB function palliatively. RDD will be deprecated as of Apache Spark version 3.0.
Thak's.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-25-2022 03:39 AM
@Kaniz Fatma hi, do you have access to orders that were opened in Databricks? The Ticket was opened in December 2021 and so far they have not commented on the deadline. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2022 08:20 AM
@Hubert Dudek
Have to tried with file:/// ?
I remember starting Spark 3.2, it honors the native hadoop file system if no file access protocol is defined.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2022 03:13 PM
Hi it was few months ago. I need to check it again with new DR.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-28-2022 11:08 AM
I confirm that as @Arvind Ravish said adding file:/// is solving the problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-28-2022 11:19 AM
Hey,
But will this allocated address change? it would have to work according to the community. But thanks for the feedback.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-28-2022 11:37 AM


- « Previous
-
- 1
- 2
- Next »