01-26-2018 02:52 AM
Hi,
I have files hosted on an Azure Data Lake Store which I can connect from Azure Databricks configured as per instructions here.
I can read JSON files fine, however, I'm getting the following error when I try to read an Avro file.
spark.read.format("com.databricks.spark.avro").load("adl://blah.azuredatalakestore.net/blah/blah.avro")
IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'
I made sure that the file existed by running
dbutils.fs.ls("adl://blah.azuredatalakestore.net/blah/blah.avro")
Please note that the error refers to
dfs.adls.oauth2.access.token.provider
not
dfs.adls.oauth2.access.token.provider.type
mentioned in the documentation above. Even after I set it to something, it would still throw the same error.
Has anyone experienced this issue before? Please let me know what else I should try to further troubleshoot. Thanks.
02-05-2018 01:22 AM
Just found a workaround for the issue with avro file read operation as it seems proper configuration for dfs.adls.oauth2.access.token.provider is not setup inside. If the ADL folder is mounted on databrick notebook , then it is working . Please try following steps
1. Mount adl folder
val configs = Map(
"dfs.adls.oauth2.access.token.provider.type" -> "ClientCredential",
"dfs.adls.oauth2.client.id" -> "XXX",
"dfs.adls.oauth2.credential" -> "YYY",
"dfs.adls.oauth2.refresh.url" -> "https://login.microsoftonline.com/ZZZ/oauth2/token",
"dfs.adls.oauth2.access.token.provider"->"org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider")
dbutils.fs.mount(
source = "adl://XYZ.azuredatalakestore.net/myfolder/demo/",
mountPoint = "/mnt/mymount",
extraConfigs = configs)
2.Verify your file is visible on mount
dbutils.fs.ls("dbfs:/mnt/ashitabh3")
import com.databricks.spark.avro._
spark.read.avro("dbfs:/mnt/mymount/mydata.avro").show
I can see the records now
04-15-2018 06:59 PM
Thanks for the workaround.
I had a similar issue unrelated to Avro, but in saving a Spark ML model to ADLS. Even setting the property manually:
dfs.adls.oauth2.access.token.provider org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider
when setting up the spark cluster would result in error message when trying to save to adl directly:
IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'
After mounting adl folder, saving works properly.
03-12-2019 03:28 PM
Hi Michael,
Did you find any other way? I am trying to write TF Records into ADLS and getting the same error even after setting this config.
traindf.repartition(32).write.format('tfrecords').mode('overwrite').option('recordType', 'Example').save("ADLS_URL/my/path")
05-09-2018 01:34 PM
I can also confirm your workaround is working. But, it takes a long time to mount it. The main question is why this workaround is needed in the first place. Hopefully some official response from Databricks will be provided.
02-06-2018 03:10 PM
Any chance you found a solution for this by now?
02-06-2018 03:34 PM
I have not unfortunately. I can load the Avro file as JSON although I would get corrupted data as expected, but at least that proves that the file is accessible. I don't know what's causing the above error.
02-06-2018 04:49 PM
You may want to try mounting your Data Lake Store to DBFS and access your files through the mounted path.
I have not tried it yet. You might find the following thread helpful however.
02-12-2018 03:18 PM
Any solutions for this? I can read CSV files but not geojson files because I am getting this exception.
10-27-2018 11:29 PM
I am getting same error for csv did you solved ??
04-05-2018 01:29 AM
I had the same issue with using dynamic partitioning in ADLS using Databricks Spark Sql.
You need to pass ADLS configs as Spark configs during cluster creation:
dfs.adls.oauth2.client.id ***dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/**/oauth2/token
dfs.adls.oauth2.credential **
dfs.adls.oauth2.access.token.provider.type ClientCredential
dfs.adls.oauth2.access.token.provider org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider
You also need to set hadoopConfiguration for RDD related functionality:
spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.access.token.provider.type", spark.conf.get("dfs.adls.oauth2.access.token.provider.type"))spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.client.id", spark.conf.get("dfs.adls.oauth2.client.id"))
spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.credential", spark.conf.get("dfs.adls.oauth2.credential"))
spark.sparkContext.hadoopConfiguration.set("dfs.adls.oauth2.refresh.url", spark.conf.get("dfs.adls.oauth2.refresh.url"))
Those two measures fixed the issue for me.
/Taras
10-18-2018 06:32 AM
Like Taras said, after adding spark.sparkContext.hadoopConfiguration.set no need to mount adl folder
06-11-2018 03:46 PM
Taras's answer is correct. Because spark-avro is based on the RDD APIs, the properties must be set in the hadoopConfiguration options.
Please note these docs for configuration using the RDD API: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html#access-azure-da...
In Python, you can use
sc._jsc.hadoopConfiguration().set()
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group