01-24-2023 01:12 AM
Hi community,
I'm trying to read XML data from Azure Datalake Gen 2 using com.databricks:spark-xml_2.12:0.12.0:
spark.read.format('XML').load('abfss://[CONTAINER]@[storageaccount].dfs.core.windows.net/PATH/TO/FILE.xml')
The code above gives the following exception:
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Py4JJavaError Traceback (most recent call last)
<command-568646403925120> in <module>
----> 1 spark.read.format('XML').load('abfss://[container]@[storageaccount].dfs.core.windows.net/[PATH]/[FILE].xml')
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
156 self.options(**options)
157 if isinstance(path, str):
--> 158 return self._df(self._jreader.load(path))
159 elif path is not None:
160 if type(path) != list:
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o474.load.
: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:577)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1832)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:224)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:142)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:537)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:530)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:499)
at org.apache.spark.SparkContext.$anonfun$newAPIHadoopFile$2(SparkContext.scala:1533)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:1066)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1520)
at com.databricks.spark.xml.util.XmlFile$.withCharset(XmlFile.scala:46)
at com.databricks.spark.xml.DefaultSource.$anonfun$createRelation$1(DefaultSource.scala:71)
at com.databricks.spark.xml.XmlRelation.$anonfun$schema$1(XmlRelation.scala:43)
at scala.Option.getOrElse(Option.scala:189)
at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:42)
at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:356)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:323)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:323)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:236)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:750)
Caused by: Invalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.diagnostics.ConfigurationBasicValidator.validate(ConfigurationBasicValidator.java:49)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.diagnostics.Base64StringConfigurationBasicValidator.validate(Base64StringConfigurationBasicValidator.java:40)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.validateStorageAccountKey(SimpleKeyProvider.java:70)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:49)
... 40 more
I do this on a non-UC enabled cluster (no isolation shared), Databricks version 10.4LTS.
In order to connect to ADLS, we've set the following Spark config (following the docs😞
spark.databricks.cluster.profile singleNode
spark.master local[*, 4]
fs.azure.account.oauth2.client.id.nubulosdpdlsdev01.dfs.core.windows.net [SPN-ID-HERE]
fs.azure.account.auth.type.nubulosdpdlsdev01.dfs.core.windows.net OAuth
fs.azure.account.oauth2.client.secret.nubulosdpdlsdev01.dfs.core.windows.net [SPN-SECRET-HERE]
fs.azure.account.oauth.provider.type.nubulosdpdlsdev01.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.databricks.delta.preview.enabled true
fs.azure.account.oauth2.client.endpoint.nubulosdpdlsdev01.dfs.core.windows.net https://login.microsoftonline.com/[TENANT-ID-HERE]/oauth2/token
We've chosen to use a service-principal for authentication, so we explicitly do not want to use the account key. Reading file using delta, text or csv works.
My issue seems to be related to this issue.
01-24-2023 04:43 AM
The issue was also raised here: https://github.com/databricks/spark-xml/issues/591
A fix is to use the "spark.hadoop" prefix in front of the fs.azure spark config keys:
spark.hadoop.fs.azure.account.oauth2.client.id.nubulosdpdlsdev01.dfs.core.windows.net [SPN-ID-HERE]
spark.hadoop.fs.azure.account.auth.type.nubulosdpdlsdev01.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth2.client.secret.nubulosdpdlsdev01.dfs.core.windows.net [SPN-SECRET-HERE]
spark.hadoop.fs.azure.account.oauth.provider.type.nubulosdpdlsdev01.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.databricks.delta.preview.enabled true
spark.hadoop.fs.azure.account.oauth2.client.endpoint.nubulosdpdlsdev01.dfs.core.windows.net https://login.microsoftonline.com/[TENANT-ID-HERE]/oauth2/token
01-24-2023 04:43 AM
The issue was also raised here: https://github.com/databricks/spark-xml/issues/591
A fix is to use the "spark.hadoop" prefix in front of the fs.azure spark config keys:
spark.hadoop.fs.azure.account.oauth2.client.id.nubulosdpdlsdev01.dfs.core.windows.net [SPN-ID-HERE]
spark.hadoop.fs.azure.account.auth.type.nubulosdpdlsdev01.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth2.client.secret.nubulosdpdlsdev01.dfs.core.windows.net [SPN-SECRET-HERE]
spark.hadoop.fs.azure.account.oauth.provider.type.nubulosdpdlsdev01.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.databricks.delta.preview.enabled true
spark.hadoop.fs.azure.account.oauth2.client.endpoint.nubulosdpdlsdev01.dfs.core.windows.net https://login.microsoftonline.com/[TENANT-ID-HERE]/oauth2/token
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.