Hi community,
I'm trying to read XML data from Azure Datalake Gen 2 using com.databricks:spark-xml_2.12:0.12.0:
spark.read.format('XML').load('abfss://[CONTAINER]@[storageaccount].dfs.core.windows.net/PATH/TO/FILE.xml')
The code above gives the following exception:
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Py4JJavaError Traceback (most recent call last)
<command-568646403925120> in <module>
----> 1 spark.read.format('XML').load('abfss://[container]@[storageaccount].dfs.core.windows.net/[PATH]/[FILE].xml')
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
156 self.options(**options)
157 if isinstance(path, str):
--> 158 return self._df(self._jreader.load(path))
159 elif path is not None:
160 if type(path) != list:
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o474.load.
: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:577)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1832)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:224)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:142)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:537)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:530)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:499)
at org.apache.spark.SparkContext.$anonfun$newAPIHadoopFile$2(SparkContext.scala:1533)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:1066)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1520)
at com.databricks.spark.xml.util.XmlFile$.withCharset(XmlFile.scala:46)
at com.databricks.spark.xml.DefaultSource.$anonfun$createRelation$1(DefaultSource.scala:71)
at com.databricks.spark.xml.XmlRelation.$anonfun$schema$1(XmlRelation.scala:43)
at scala.Option.getOrElse(Option.scala:189)
at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:42)
at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:356)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:323)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:323)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:236)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:750)
Caused by: Invalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.diagnostics.ConfigurationBasicValidator.validate(ConfigurationBasicValidator.java:49)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.diagnostics.Base64StringConfigurationBasicValidator.validate(Base64StringConfigurationBasicValidator.java:40)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.validateStorageAccountKey(SimpleKeyProvider.java:70)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:49)
... 40 more
I do this on a non-UC enabled cluster (no isolation shared), Databricks version 10.4LTS.
In order to connect to ADLS, we've set the following Spark config (following the docs😞
spark.databricks.cluster.profile singleNode
spark.master local[*, 4]
fs.azure.account.oauth2.client.id.nubulosdpdlsdev01.dfs.core.windows.net [SPN-ID-HERE]
fs.azure.account.auth.type.nubulosdpdlsdev01.dfs.core.windows.net OAuth
fs.azure.account.oauth2.client.secret.nubulosdpdlsdev01.dfs.core.windows.net [SPN-SECRET-HERE]
fs.azure.account.oauth.provider.type.nubulosdpdlsdev01.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.databricks.delta.preview.enabled true
fs.azure.account.oauth2.client.endpoint.nubulosdpdlsdev01.dfs.core.windows.net https://login.microsoftonline.com/[TENANT-ID-HERE]/oauth2/token
We've chosen to use a service-principal for authentication, so we explicitly do not want to use the account key. Reading file using delta, text or csv works.
My issue seems to be related to this issue.