cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

SparkFiles - strange behavior on Azure databricks (runtime 10)

Hubert-Dudek
Esteemed Contributor III

When you use:

from pyspark import SparkFiles
spark.sparkContext.addFile(url)

it adds file to NON dbfs /local_disk0/ but then when you want to read file:

spark.read.json(SparkFiles.get("file_name"))

it wants to read it from /dbfs/local_disk0/. I tried also with file:// and many other creative ways and it doesn't work.

Of course it is working after using %sh cp - moving from /local_disk0/ to /dbfs/local_disk0/ .

It seems to be a bug like addFile was switched to dbfs on azure databricks but SparkFiles not (in original spark it addFile and gets to/from workers).

I couldn't find also any settings to manually specify RootDirectory for SparkFiles.

29 REPLIES 29

weldermartins
Honored Contributor

Hello everyone, any news?

Thanks.

@Kaniz Fatma​ 

Hi @welder martins​ , If you've already created a support ticket, can you please share the ticket details?

weldermartins
Honored Contributor

Hi @Kaniz Fatma​, Ticket Number: #00125834.

It's been over a month since the ticket was opened, but still no response.

I tested it now with version 3.2.0 of Apache Spark on the Azure platform, it continues the same way with the message: "File not found". But in community.cloud.databricks the path is found and returns the expected result.

Hi @welder martins​ , Please try to add /dbfs/ before /mnt......... at the beginning of the URL.

weldermartins
Honored Contributor
municipios = "https://servicodados.ibge.gov.br/api/v1/localidades/municipios"
from pyspark import SparkFiles
spark.sparkContext.addFile(municipios)
 
municipiosDF = spark.read.option("multiLine", True).option("mode", "OVERRIDE").json("file://"+SparkFiles.get("municipios"))

I did not understand.

Please change the code above as instructed by you. @Kaniz Fatma​ 

att,

Welder Martins

Hi @welder martins​ , You can read your JSON file's URL by this method too.

from pyspark.sql import SparkSession, functions as F
from urllib.request import urlopen
 
spark = SparkSession.builder.getOrCreate()
 
url = 'https://servicodados.ibge.gov.br/api/v1/localidades/municipios'
jsonData = urlopen(url).read().decode('utf-8')
rdd = spark.sparkContext.parallelize([jsonData])
df = spark.read.json(rdd)
display(df)

Kaniz
Community Manager
Community Manager

Hi @welder martins​ , you can also try this. Attached is the screenshot too. Please let me know if that doesn't work.

 import requests
 response = requests.get('https://servicodados.ibge.gov.br/api/v1/localidades/municipios')
 jsondata = response.json()
 from pyspark.sql import *
 df = spark.read.option("multiline", "true").json(sc.parallelize([jsondata]))
 df.show()

Screenshot 2022-01-24 at 9.44.28 PM

weldermartins
Honored Contributor

Hi @Kaniz Fatma (Databricks), it ran without errors. The problem is that SparkFiles doesn't work on the Azure platform. I'm extracting data from the API with other functionality. I'm even using the URLLIB function palliatively. RDD will be deprecated as of Apache Spark version 3.0.

Thak's.

weldermartins
Honored Contributor

@Kaniz Fatma​  hi, do you have access to orders that were opened in Databricks? The Ticket was opened in December 2021 and so far they have not commented on the deadline. Thanks.

Hi @welder martins​ , We're working on it. Please give us some time.

User16764241763
Honored Contributor

@Hubert Dudek​ 

Have to tried with file:/// ?

I remember starting Spark 3.2, it honors the native hadoop file system if no file access protocol is defined.

Hubert-Dudek
Esteemed Contributor III

Hi it was few months ago. I need to check it again with new DR.

Hubert-Dudek
Esteemed Contributor III

I confirm that as @Arvind Ravish​ said adding file:/// is solving the problem.

image.png

Hey,

But will this allocated address change? it would have to work according to the community. But thanks for the feedback.

Hubert-Dudek
Esteemed Contributor III

polished syntax a bit:image

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.