cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Bbren
by New Contributor
  • 3031 Views
  • 2 replies
  • 1 kudos

Resolved! Handling of millions of xml in json files

Hi all, i have some questions related to the handling of many smalls files and possible improvements and augmentations. We have many small xml files. These files are previously processed by another system that puts them in our datalake, but as an add...

  • 3031 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Bauke Brenninkmeijer​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ...

  • 1 kudos
1 More Replies
CDICSteph
by New Contributor
  • 2400 Views
  • 2 replies
  • 0 kudos

Need pattern for loading a million small XML files

Hi, looking for the right solution pattern for this scenario: We have millions of relatively small XML files (currently sitting in ADLS) that we have to load into delta lake. Each XML file has to be read, parsed, and pivoted before writing to a delta...

  • 2400 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Steph Swierenga​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 0 kudos
1 More Replies
Himanshu1
by New Contributor II
  • 2461 Views
  • 1 replies
  • 3 kudos

How to read XML files in delta live tables?

Even after maven library installation using the Auto installation.spark.read.option("rowTag", "tag").xml("dbfs:/mnt/dev/bronze/xml/fileName.xml")not working.

image.png
  • 2461 Views
  • 1 replies
  • 3 kudos
Latest Reply
DD_Sharma
New Contributor III
  • 3 kudos

At present DLT does not support installing the maven library from the DLT pipeline. In the future this feature will come for sure so please wait for some time and keep checking data bricks runtime release docs https://docs.databricks.com/release-note...

  • 3 kudos
oleole
by Contributor
  • 8514 Views
  • 3 replies
  • 2 kudos

Resolved! Using "FOR XML PATH" in Spark SQL in sql syntax

I'm using spark version 3.2.1 on databricks (DBR 10.4 LTS), and I'm trying to convert sql server sql query to a new sql query that runs on a spark cluster using spark sql in sql syntax. However, spark sql does not seem to support XML PATH as a functi...

input output
  • 8514 Views
  • 3 replies
  • 2 kudos
Latest Reply
oleole
Contributor
  • 2 kudos

Posting the solution that I ended up using:%sql DROP TABLE if exists UserCountry; CREATE TABLE if not exists UserCountry ( UserID INT, Country VARCHAR(5000) ); INSERT INTO UserCountry SELECT L.UserID AS UserID, CONCAT_WS(',', co...

  • 2 kudos
2 More Replies
powerus
by New Contributor III
  • 5271 Views
  • 1 replies
  • 0 kudos

Resolved! "Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key" using com.databricks:spark-xml_2.12:0.12.0

Hi community,I'm trying to read XML data from Azure Datalake Gen 2 using com.databricks:spark-xml_2.12:0.12.0:spark.read.format('XML').load('abfss://[CONTAINER]@[storageaccount].dfs.core.windows.net/PATH/TO/FILE.xml')The code above gives the followin...

  • 5271 Views
  • 1 replies
  • 0 kudos
Latest Reply
powerus
New Contributor III
  • 0 kudos

The issue was also raised here: https://github.com/databricks/spark-xml/issues/591A fix is to use the "spark.hadoop" prefix in front of the fs.azure spark config keys:spark.hadoop.fs.azure.account.oauth2.client.id.nubulosdpdlsdev01.dfs.core.windows.n...

  • 0 kudos
rammy
by Contributor III
  • 3253 Views
  • 3 replies
  • 11 kudos

How would i retrieve data JSON data with namespaces using spark SQL?

File.json from the below code contains huge JSON data with each key containing namespace prefix(This JSON file converted from the XML file).I could able to retrieve if JSON does not contain namespaces but what could be the approach to retrieve record...

image.png image
  • 3253 Views
  • 3 replies
  • 11 kudos
Latest Reply
SS2
Valued Contributor
  • 11 kudos

I case of struct you can use (.) For extracting the value

  • 11 kudos
2 More Replies
Stita
by New Contributor II
  • 3196 Views
  • 1 replies
  • 2 kudos

Resolved! How do we pass the row tags dynamically while reading a XML file into a dataframe?

I have a set of xml files where the row tags change dynamically. How can we achieve this scenario in databricks.df1=spark.read.format('xml').option('rootTag','XRoot').option('rowTag','PL1PLLL').load("dbfs:/FileStore/tables/ins/")We need to pass a val...

  • 3196 Views
  • 1 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

If it is dynamically for the whole file, you can just use variabletag = 'PL1PLLL' df1=spark.read.format('xml').option('rootTag','XRoot').option('rowTag' ,tag).load("dbfs:/FileStore/tables/ins/file.xml")

  • 2 kudos
PriyaTech
by New Contributor
  • 3600 Views
  • 1 replies
  • 2 kudos

Resolved! Converting Dataframe into Nested xml

e.g.dataframe is having firstname,lastname,middlename,id,salaryI need to convert dataframe in xml file but in nested format.output as nested xml<Name>    <firatname> <middlename>    <lastname>    </Name><id></id><salary></salary>Anyone has ides ho...

  • 3600 Views
  • 1 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

databricks has a xml connector:https://docs.databricks.com/data/data-sources/xml.htmlBasically you just define a df with the correct structure and write it to xml.To create a nested df, here you can find some info.

  • 2 kudos
Sha_1890
by New Contributor III
  • 6810 Views
  • 5 replies
  • 3 kudos

java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class Error on writing a dataframe to a SQL DB in azure

I have to write the extracted data from XML to DB , i am using Dataframe for transformation and trying to load that to DB.I have installed these libraries,com.databricks:spark-xml_2.12:0.15.0com.microsoft.azure:spark-mssql-connector_2.11_2.4:1.0.2and...

  • 6810 Views
  • 5 replies
  • 3 kudos
Latest Reply
Vidula
Honored Contributor
  • 3 kudos

Hi @shafana Roohi Jahubar​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from ...

  • 3 kudos
4 More Replies
Paramesh
by New Contributor II
  • 3859 Views
  • 3 replies
  • 2 kudos

Resolved! How to read multiple tiny XML files in parallel

Hi team, we are trying to read multiple tiny XML files, able to parse them using the data bricks XML jar, but is there any way to read these files in parallel and distribute the load across the cluster? right now our job is taking 90% of the time rea...

  • 3859 Views
  • 3 replies
  • 2 kudos
Latest Reply
Paramesh
New Contributor II
  • 2 kudos

Thank you @Hubert Dudek​ for the suggestion. Similar to your recommendation, we added a step in our pipeline to merge the small files to large files and make them available for the spark job.

  • 2 kudos
2 More Replies
Michael_Galli
by Contributor III
  • 3617 Views
  • 4 replies
  • 2 kudos

Resolved! Unittest in PySpark - how to read XML with Maven com.databricks.spark.xml ?

When writing unit tests with unittest / pytest in PySpark, reading mockup datasources with built-in datatypes like csv, json (spark.read.format("json")) works just fine.But when reading XML´s with spark.read.format("com.databricks.spark.xml") in the ...

  • 3617 Views
  • 4 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

Please install spark-xml from Maven. As it is from Maven you need to install it for cluster which you are using in cluster settings (alternatively using API or CLI)https://mvnrepository.com/artifact/com.databricks/spark-xml

  • 2 kudos
3 More Replies
wyzer
by Contributor II
  • 4949 Views
  • 8 replies
  • 4 kudos

Unable to read an XML file of 9 GB

Hello,We have a large XML file (9 GB) that we can't read.We have this error : VM size limitBut how can we change the VM size limit ?We have tested many clusters, but no one can read this file.Thank you for your help.

  • 4949 Views
  • 8 replies
  • 4 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 4 kudos

Hi @Salah K.​,Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

  • 4 kudos
7 More Replies
Labels