cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

SparkFiles - strange behavior on Azure databricks (runtime 10)

Hubert-Dudek
Esteemed Contributor III

When you use:

from pyspark import SparkFiles
spark.sparkContext.addFile(url)

it adds file to NON dbfs /local_disk0/ but then when you want to read file:

spark.read.json(SparkFiles.get("file_name"))

it wants to read it from /dbfs/local_disk0/. I tried also with file:// and many other creative ways and it doesn't work.

Of course it is working after using %sh cp - moving from /local_disk0/ to /dbfs/local_disk0/ .

It seems to be a bug like addFile was switched to dbfs on azure databricks but SparkFiles not (in original spark it addFile and gets to/from workers).

I couldn't find also any settings to manually specify RootDirectory for SparkFiles.

1 ACCEPTED SOLUTION

Accepted Solutions

User16764241763
Honored Contributor

@Hubert Dudek​ 

Have to tried with file:/// ?

I remember starting Spark 3.2, it honors the native hadoop file system if no file access protocol is defined.

View solution in original post

29 REPLIES 29

weldermartins
Honored Contributor

Hello.

I'm in the same situation. Data extraction via API using sparkfiles in Community Databricks runs without error, however in Azure it generates the mentioned error.

jorgeff
New Contributor II

In Azure it generates the mentioned error too

Hubert-Dudek
Esteemed Contributor III

@Kaniz Fatma​ @Piper Wilson​ can you help to escalate that issue, as more people are complaining about that

Marcos_Gois
New Contributor II

Hello everyone

This problem to be happening with me too, in Azure. If somebody to can help us

Anonymous
Not applicable

@Hubert Dudek​ - You got it!

weldermartins
Honored Contributor

Hi, I'm new here and I have some doubts. Will the bug fix be attended to only if there are votes, comments and views?

Hubert-Dudek
Esteemed Contributor III

someone should get back to us

weldermartins
Honored Contributor

@Kaniz Fatma (Databricks) @Piper (Customer) 

Hi how are you?

Does this problem have a solution option?

Hubert-Dudek
Esteemed Contributor III

@Prabakar Ammeappin​ @Werner Stinckens​ @Jose Gonzalez​  maybe you could look as well to that issue 🙂

Anonymous
Not applicable

@Hubert Dudek​, @Dev John​, @Marcos Gois​, @Jorge Fernandes​, and @welder martins​ - Are you able to open a support ticket here - https://help.databricks.com/s/contact-us?

weldermartins
Honored Contributor

ok, pity it can't be solved around here. But the ticket was opened, I give news of the progress.

thanks.

Anonymous
Not applicable

@welder martins​ - Thank you for opening the ticket. We want to cover all our bases.

Hubert-Dudek
Esteemed Contributor III

Yes this solution was already discussed on stackoverflow. Problem is that this spark functionality should be adjusted in DBR to handle everything automatically via dbfs. Problem is that it seems that it was partly adjusted but not fully.

Hi @Hubert Dudek​ , We'll get back to you soon.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.