cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Bbren
by New Contributor
  • 2419 Views
  • 2 replies
  • 1 kudos

Resolved! Handling of millions of xml in json files

Hi all, i have some questions related to the handling of many smalls files and possible improvements and augmentations. We have many small xml files. These files are previously processed by another system that puts them in our datalake, but as an add...

  • 2419 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Bauke Brenninkmeijer​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ...

  • 1 kudos
1 More Replies
CDICSteph
by New Contributor
  • 1985 Views
  • 2 replies
  • 0 kudos

Need pattern for loading a million small XML files

Hi, looking for the right solution pattern for this scenario: We have millions of relatively small XML files (currently sitting in ADLS) that we have to load into delta lake. Each XML file has to be read, parsed, and pivoted before writing to a delta...

  • 1985 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Steph Swierenga​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 0 kudos
1 More Replies
Himanshu1
by New Contributor II
  • 2080 Views
  • 1 replies
  • 3 kudos

How to read XML files in delta live tables?

Even after maven library installation using the Auto installation.spark.read.option("rowTag", "tag").xml("dbfs:/mnt/dev/bronze/xml/fileName.xml")not working.

image.png
  • 2080 Views
  • 1 replies
  • 3 kudos
Latest Reply
DD_Sharma
New Contributor III
  • 3 kudos

At present DLT does not support installing the maven library from the DLT pipeline. In the future this feature will come for sure so please wait for some time and keep checking data bricks runtime release docs https://docs.databricks.com/release-note...

  • 3 kudos
oleole
by Contributor
  • 6996 Views
  • 3 replies
  • 2 kudos

Resolved! Using "FOR XML PATH" in Spark SQL in sql syntax

I'm using spark version 3.2.1 on databricks (DBR 10.4 LTS), and I'm trying to convert sql server sql query to a new sql query that runs on a spark cluster using spark sql in sql syntax. However, spark sql does not seem to support XML PATH as a functi...

input output
  • 6996 Views
  • 3 replies
  • 2 kudos
Latest Reply
oleole
Contributor
  • 2 kudos

Posting the solution that I ended up using:%sql DROP TABLE if exists UserCountry; CREATE TABLE if not exists UserCountry ( UserID INT, Country VARCHAR(5000) ); INSERT INTO UserCountry SELECT L.UserID AS UserID, CONCAT_WS(',', co...

  • 2 kudos
2 More Replies
powerus
by New Contributor III
  • 4575 Views
  • 1 replies
  • 0 kudos

Resolved! "Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key" using com.databricks:spark-xml_2.12:0.12.0

Hi community,I'm trying to read XML data from Azure Datalake Gen 2 using com.databricks:spark-xml_2.12:0.12.0:spark.read.format('XML').load('abfss://[CONTAINER]@[storageaccount].dfs.core.windows.net/PATH/TO/FILE.xml')The code above gives the followin...

  • 4575 Views
  • 1 replies
  • 0 kudos
Latest Reply
powerus
New Contributor III
  • 0 kudos

The issue was also raised here: https://github.com/databricks/spark-xml/issues/591A fix is to use the "spark.hadoop" prefix in front of the fs.azure spark config keys:spark.hadoop.fs.azure.account.oauth2.client.id.nubulosdpdlsdev01.dfs.core.windows.n...

  • 0 kudos
rammy
by Contributor III
  • 2472 Views
  • 3 replies
  • 11 kudos

How would i retrieve data JSON data with namespaces using spark SQL?

File.json from the below code contains huge JSON data with each key containing namespace prefix(This JSON file converted from the XML file).I could able to retrieve if JSON does not contain namespaces but what could be the approach to retrieve record...

image.png image
  • 2472 Views
  • 3 replies
  • 11 kudos
Latest Reply
SS2
Valued Contributor
  • 11 kudos

I case of struct you can use (.) For extracting the value

  • 11 kudos
2 More Replies
Stita
by New Contributor II
  • 2744 Views
  • 3 replies
  • 3 kudos

Resolved! How do we pass the row tags dynamically while reading a XML file into a dataframe?

I have a set of xml files where the row tags change dynamically. How can we achieve this scenario in databricks.df1=spark.read.format('xml').option('rootTag','XRoot').option('rowTag','PL1PLLL').load("dbfs:/FileStore/tables/ins/")We need to pass a val...

  • 2744 Views
  • 3 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

If it is dynamically for the whole file, you can just use variabletag = 'PL1PLLL' df1=spark.read.format('xml').option('rootTag','XRoot').option('rowTag' ,tag).load("dbfs:/FileStore/tables/ins/file.xml")

  • 3 kudos
2 More Replies
db-avengers2rul
by Contributor II
  • 25802 Views
  • 1 replies
  • 0 kudos

Resolved! Max file size allowed to import into Databricks community edition ?

Hi All,I have few questions using the community edition 1) max file size that is allowed to be uploaded (data file) in community edition ?2) is XML file supported as well ? Regards,Rakesh

  • 25802 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Rakesh Reddy Gopidi​, Hope this thread helps answer your first question.

  • 0 kudos
PriyaTech
by New Contributor
  • 3107 Views
  • 2 replies
  • 2 kudos

Resolved! Converting Dataframe into Nested xml

e.g.dataframe is having firstname,lastname,middlename,id,salaryI need to convert dataframe in xml file but in nested format.output as nested xml<Name>    <firatname> <middlename>    <lastname>    </Name><id></id><salary></salary>Anyone has ides ho...

  • 3107 Views
  • 2 replies
  • 2 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 2 kudos

Hi @Priyanka Jain​, We haven't heard from you on the last response from @Werner Stinckens​​, and I was checking back to see if his suggestions helped you. Or else, If you have any solution, please share it with the community as it can be helpful to o...

  • 2 kudos
1 More Replies
Sha_1890
by New Contributor III
  • 5905 Views
  • 5 replies
  • 3 kudos

java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class Error on writing a dataframe to a SQL DB in azure

I have to write the extracted data from XML to DB , i am using Dataframe for transformation and trying to load that to DB.I have installed these libraries,com.databricks:spark-xml_2.12:0.15.0com.microsoft.azure:spark-mssql-connector_2.11_2.4:1.0.2and...

  • 5905 Views
  • 5 replies
  • 3 kudos
Latest Reply
Vidula
Honored Contributor
  • 3 kudos

Hi @shafana Roohi Jahubar​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from ...

  • 3 kudos
4 More Replies
Paramesh
by New Contributor II
  • 3304 Views
  • 4 replies
  • 2 kudos

Resolved! How to read multiple tiny XML files in parallel

Hi team, we are trying to read multiple tiny XML files, able to parse them using the data bricks XML jar, but is there any way to read these files in parallel and distribute the load across the cluster? right now our job is taking 90% of the time rea...

  • 3304 Views
  • 4 replies
  • 2 kudos
Latest Reply
Paramesh
New Contributor II
  • 2 kudos

Thank you @Hubert Dudek​ for the suggestion. Similar to your recommendation, we added a step in our pipeline to merge the small files to large files and make them available for the spark job.

  • 2 kudos
3 More Replies
Michael_Galli
by Contributor III
  • 3078 Views
  • 4 replies
  • 2 kudos

Resolved! Unittest in PySpark - how to read XML with Maven com.databricks.spark.xml ?

When writing unit tests with unittest / pytest in PySpark, reading mockup datasources with built-in datatypes like csv, json (spark.read.format("json")) works just fine.But when reading XML´s with spark.read.format("com.databricks.spark.xml") in the ...

  • 3078 Views
  • 4 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

Please install spark-xml from Maven. As it is from Maven you need to install it for cluster which you are using in cluster settings (alternatively using API or CLI)https://mvnrepository.com/artifact/com.databricks/spark-xml

  • 2 kudos
3 More Replies
wyzer
by Contributor II
  • 4103 Views
  • 9 replies
  • 4 kudos

Unable to read an XML file of 9 GB

Hello,We have a large XML file (9 GB) that we can't read.We have this error : VM size limitBut how can we change the VM size limit ?We have tested many clusters, but no one can read this file.Thank you for your help.

  • 4103 Views
  • 9 replies
  • 4 kudos
Latest Reply
jose_gonzalez
Moderator
  • 4 kudos

Hi @Salah K.​,Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

  • 4 kudos
8 More Replies
Labels