01-24-2023 01:12 AM
Hi community,
I'm trying to read XML data from Azure Datalake Gen 2 using com.databricks:spark-xml_2.12:0.12.0:
spark.read.format('XML').load('abfss://[CONTAINER]@[storageaccount].dfs.core.windows.net/PATH/TO/FILE.xml')
The code above gives the following exception:
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Py4JJavaError Traceback (most recent call last)
<command-568646403925120> in <module>
----> 1 spark.read.format('XML').load('abfss://[container]@[storageaccount].dfs.core.windows.net/[PATH]/[FILE].xml')
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
156 self.options(**options)
157 if isinstance(path, str):
--> 158 return self._df(self._jreader.load(path))
159 elif path is not None:
160 if type(path) != list:
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o474.load.
: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:577)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1832)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:224)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:142)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:537)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:530)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:499)
at org.apache.spark.SparkContext.$anonfun$newAPIHadoopFile$2(SparkContext.scala:1533)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:1066)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1520)
at com.databricks.spark.xml.util.XmlFile$.withCharset(XmlFile.scala:46)
at com.databricks.spark.xml.DefaultSource.$anonfun$createRelation$1(DefaultSource.scala:71)
at com.databricks.spark.xml.XmlRelation.$anonfun$schema$1(XmlRelation.scala:43)
at scala.Option.getOrElse(Option.scala:189)
at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:42)
at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:356)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:323)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:323)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:236)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:750)
Caused by: Invalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.diagnostics.ConfigurationBasicValidator.validate(ConfigurationBasicValidator.java:49)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.diagnostics.Base64StringConfigurationBasicValidator.validate(Base64StringConfigurationBasicValidator.java:40)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.validateStorageAccountKey(SimpleKeyProvider.java:70)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:49)
... 40 more
I do this on a non-UC enabled cluster (no isolation shared), Databricks version 10.4LTS.
In order to connect to ADLS, we've set the following Spark config (following the docs😞
spark.databricks.cluster.profile singleNode
spark.master local[*, 4]
fs.azure.account.oauth2.client.id.nubulosdpdlsdev01.dfs.core.windows.net [SPN-ID-HERE]
fs.azure.account.auth.type.nubulosdpdlsdev01.dfs.core.windows.net OAuth
fs.azure.account.oauth2.client.secret.nubulosdpdlsdev01.dfs.core.windows.net [SPN-SECRET-HERE]
fs.azure.account.oauth.provider.type.nubulosdpdlsdev01.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.databricks.delta.preview.enabled true
fs.azure.account.oauth2.client.endpoint.nubulosdpdlsdev01.dfs.core.windows.net https://login.microsoftonline.com/[TENANT-ID-HERE]/oauth2/token
We've chosen to use a service-principal for authentication, so we explicitly do not want to use the account key. Reading file using delta, text or csv works.
My issue seems to be related to this issue.
01-24-2023 04:43 AM
The issue was also raised here: https://github.com/databricks/spark-xml/issues/591
A fix is to use the "spark.hadoop" prefix in front of the fs.azure spark config keys:
spark.hadoop.fs.azure.account.oauth2.client.id.nubulosdpdlsdev01.dfs.core.windows.net [SPN-ID-HERE]
spark.hadoop.fs.azure.account.auth.type.nubulosdpdlsdev01.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth2.client.secret.nubulosdpdlsdev01.dfs.core.windows.net [SPN-SECRET-HERE]
spark.hadoop.fs.azure.account.oauth.provider.type.nubulosdpdlsdev01.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.databricks.delta.preview.enabled true
spark.hadoop.fs.azure.account.oauth2.client.endpoint.nubulosdpdlsdev01.dfs.core.windows.net https://login.microsoftonline.com/[TENANT-ID-HERE]/oauth2/token
01-24-2023 04:43 AM
The issue was also raised here: https://github.com/databricks/spark-xml/issues/591
A fix is to use the "spark.hadoop" prefix in front of the fs.azure spark config keys:
spark.hadoop.fs.azure.account.oauth2.client.id.nubulosdpdlsdev01.dfs.core.windows.net [SPN-ID-HERE]
spark.hadoop.fs.azure.account.auth.type.nubulosdpdlsdev01.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth2.client.secret.nubulosdpdlsdev01.dfs.core.windows.net [SPN-SECRET-HERE]
spark.hadoop.fs.azure.account.oauth.provider.type.nubulosdpdlsdev01.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.databricks.delta.preview.enabled true
spark.hadoop.fs.azure.account.oauth2.client.endpoint.nubulosdpdlsdev01.dfs.core.windows.net https://login.microsoftonline.com/[TENANT-ID-HERE]/oauth2/token
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group