Databricks

microamp · ‎01-26-2018

Hi,

I have files hosted on an Azure Data Lake Store which I can connect from Azure Databricks configured as per instructions here.

I can read JSON files fine, however, I'm getting the following error when I try to read an Avro file.

spark.read.format("com.databricks.spark.avro").load("adl://blah.azuredatalakestore.net/blah/blah.avro")

IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'

I made sure that the file existed by running

dbutils.fs.ls("adl://blah.azuredatalakestore.net/blah/blah.avro")

Please note that the error refers to

dfs.adls.oauth2.access.token.provider

not

dfs.adls.oauth2.access.token.provider.type

mentioned in the documentation above. Even after I set it to something, it would still throw the same error.

Has anyone experienced this issue before? Please let me know what else I should try to further troubleshoot. Thanks.

AshitabhKumar · ‎02-05-2018

Just found a workaround for the issue with avro file read operation as it seems proper configuration for dfs.adls.oauth2.access.token.provider is not setup inside. If the ADL folder is mounted on databrick notebook , then it is working . Please try following steps

1. Mount adl folder

val configs = Map(
  "dfs.adls.oauth2.access.token.provider.type" -> "ClientCredential",
  "dfs.adls.oauth2.client.id" -> "XXX",
  "dfs.adls.oauth2.credential" -> "YYY",
  "dfs.adls.oauth2.refresh.url" -> "https://login.microsoftonline.com/ZZZ/oauth2/token",
  "dfs.adls.oauth2.access.token.provider"->"org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider")

dbutils.fs.mount(
  source = "adl://XYZ.azuredatalakestore.net/myfolder/demo/",
  mountPoint = "/mnt/mymount",
  extraConfigs = configs)

2.Verify your file is visible on mount

dbutils.fs.ls("dbfs:/mnt/ashitabh3")

import com.databricks.spark.avro._

spark.read.avro("dbfs:/mnt/mymount/mydata.avro").show

I can see the records now

MichaelParque · ‎04-15-2018

Thanks for the workaround.

I had a similar issue unrelated to Avro, but in saving a Spark ML model to ADLS. Even setting the property manually:

dfs.adls.oauth2.access.token.provider org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider

when setting up the spark cluster would result in error message when trying to save to adl directly:

IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'

After mounting adl folder, saving works properly.

SmitShah · ‎03-12-2019

Hi Michael,

Did you find any other way? I am trying to write TF Records into ADLS and getting the same error even after setting this config.

traindf.repartition(32).write.format('tfrecords').mode('overwrite').option('recordType', 'Example').save("ADLS_URL/my/path")

_not_provid1608 · ‎05-09-2018

I can also confirm your workaround is working. But, it takes a long time to mount it. The main question is why this workaround is needed in the first place. Hopefully some official response from Databricks will be provided.

adina · ‎02-06-2018

Any chance you found a solution for this by now?

microamp · ‎02-06-2018

I have not unfortunately. I can load the Avro file as JSON although I would get corrupted data as expected, but at least that proves that the file is accessible. I don't know what's causing the above error.

microamp · ‎02-06-2018

You may want to try mounting your Data Lake Store to DBFS and access your files through the mounted path.

https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html#mounting-azure-data-...

I have not tried it yet. You might find the following thread helpful however.

https://forums.databricks.com/questions/13266/azure-db-mount-on-python-unexpected-keyword-argume.htm...

PirrALuis_Simoe · ‎02-12-2018

Any solutions for this? I can read CSV files but not geojson files because I am getting this exception.

ChandanIsrani · ‎10-27-2018

I am getting same error for csv did you solved ??

TarasChaikovsky · ‎04-05-2018

I had the same issue with using dynamic partitioning in ADLS using Databricks Spark Sql.

You need to pass ADLS configs as Spark configs during cluster creation:

dfs.adls.oauth2.client.id ***

dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/**/oauth2/token

dfs.adls.oauth2.credential **

dfs.adls.oauth2.access.token.provider.type ClientCredential

dfs.adls.oauth2.access.token.provider org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider

You also need to set hadoopConfiguration for RDD related functionality:

spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.access.token.provider.type", spark.conf.get("dfs.adls.oauth2.access.token.provider.type"))

spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.client.id", spark.conf.get("dfs.adls.oauth2.client.id"))

spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.credential", spark.conf.get("dfs.adls.oauth2.credential"))

spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.refresh.url", spark.conf.get("dfs.adls.oauth2.refresh.url"))

Those two measures fixed the issue for me.

/Taras

DonatienTessier · ‎10-18-2018

Like Taras said, after adding spark.sparkContext.hadoopConfiguration.set no need to mount adl folder

User16301467523 · ‎06-11-2018

Taras's answer is correct. Because spark-avro is based on the RDD APIs, the properties must be set in the hadoopConfiguration options.

Please note these docs for configuration using the RDD API: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html#access-azure-da...

In Python, you can use

sc._jsc.hadoopConfiguration().set()

Databricks

Azure Data Lake Config Issue: No value for dfs.adls.oauth2.access.token.provider found in conf file.

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs