Data Engineering

Forum Posts

Sorted by:

by Bbren • New Contributor

05-11-2023 4:51:33 AM

3269 Views
2 replies
1 kudos

Resolved! Handling of millions of xml in json files

Hi all, i have some questions related to the handling of many smalls files and possible improvements and augmentations. We have many small xml files. These files are previously processed by another system that puts them in our datalake, but as an add...

Data Engineering

3269 Views
2 replies
1 kudos

05-11-2023 4:51:33 AM

View Replies

Latest Reply

Anonymous
Not applicable

05-21-2023 11:56:20 PM

1 kudos

Hi @Bauke Brenninkmeijer Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ...

1 kudos

05-21-2023 11:56:20 PM

1 More Replies

by CDICSteph • New Contributor

04-28-2023 9:29:59 AM

2529 Views
2 replies
0 kudos

Need pattern for loading a million small XML files

Hi, looking for the right solution pattern for this scenario: We have millions of relatively small XML files (currently sitting in ADLS) that we have to load into delta lake. Each XML file has to be read, parsed, and pivoted before writing to a delta...

Data Engineering

2529 Views
2 replies
0 kudos

04-28-2023 9:29:59 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-29-2023 12:20:18 AM

0 kudos

Hi @Steph Swierenga Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

0 kudos

04-29-2023 12:20:18 AM

1 More Replies

by Himanshu1 • New Contributor II

08-21-2022 11:05:34 PM

2628 Views
1 replies
3 kudos

How to read XML files in delta live tables?

Even after maven library installation using the Auto installation.spark.read.option("rowTag", "tag").xml("dbfs:/mnt/dev/bronze/xml/fileName.xml")not working.

Data Engineering

2628 Views
1 replies
3 kudos

08-21-2022 11:05:34 PM

View Replies

Latest Reply

DD_Sharma
New Contributor III

04-14-2023 12:20:03 AM

3 kudos

At present DLT does not support installing the maven library from the DLT pipeline. In the future this feature will come for sure so please wait for some time and keep checking data bricks runtime release docs https://docs.databricks.com/release-note...

3 kudos

04-14-2023 12:20:03 AM

by oleole • Contributor

03-26-2023 9:50:34 PM

9099 Views
3 replies
2 kudos

Resolved! Using "FOR XML PATH" in Spark SQL in sql syntax

I'm using spark version 3.2.1 on databricks (DBR 10.4 LTS), and I'm trying to convert sql server sql query to a new sql query that runs on a spark cluster using spark sql in sql syntax. However, spark sql does not seem to support XML PATH as a functi...

Data Engineering

9099 Views
3 replies
2 kudos

03-26-2023 9:50:34 PM

View Replies

Latest Reply

oleole
Contributor

03-30-2023 5:59:03 AM

2 kudos

Posting the solution that I ended up using:%sql DROP TABLE if exists UserCountry; CREATE TABLE if not exists UserCountry ( UserID INT, Country VARCHAR(5000) ); INSERT INTO UserCountry SELECT L.UserID AS UserID, CONCAT_WS(',', co...

2 kudos

03-30-2023 5:59:03 AM

2 More Replies

by User16691272604 • New Contributor II

03-20-2023 12:43:29 AM

1230 Views
1 replies
2 kudos

Flattening Nested XML in Databricks

Flattening Nested XML - Multiple ways

Data Engineering

1230 Views
1 replies
2 kudos

03-20-2023 12:43:29 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-22-2023 9:49:30 PM

2 kudos

Thanks for this and educating the community members about it.

2 kudos

03-22-2023 9:49:30 PM

by powerus • New Contributor III

01-24-2023 1:12:04 AM

5498 Views
1 replies
0 kudos

Resolved! "Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key" using com.databricks:spark-xml_2.12:0.12.0

Hi community,I'm trying to read XML data from Azure Datalake Gen 2 using com.databricks:spark-xml_2.12:0.12.0:spark.read.format('XML').load('abfss://[CONTAINER]@[storageaccount].dfs.core.windows.net/PATH/TO/FILE.xml')The code above gives the followin...

Data Engineering

5498 Views
1 replies
0 kudos

01-24-2023 1:12:04 AM

View Replies

Latest Reply

powerus
New Contributor III

01-24-2023 4:43:25 AM

0 kudos

The issue was also raised here: https://github.com/databricks/spark-xml/issues/591A fix is to use the "spark.hadoop" prefix in front of the fs.azure spark config keys:spark.hadoop.fs.azure.account.oauth2.client.id.nubulosdpdlsdev01.dfs.core.windows.n...

0 kudos

01-24-2023 4:43:25 AM

by rammy • Contributor III

11-21-2022 10:17:34 PM

3449 Views
3 replies
11 kudos

How would i retrieve data JSON data with namespaces using spark SQL?

File.json from the below code contains huge JSON data with each key containing namespace prefix(This JSON file converted from the XML file).I could able to retrieve if JSON does not contain namespaces but what could be the approach to retrieve record...

Data Engineering

3449 Views
3 replies
11 kudos

11-21-2022 10:17:34 PM

View Replies

Latest Reply

SS2
Valued Contributor

11-29-2022 12:45:22 PM

11 kudos

I case of struct you can use (.) For extracting the value

11 kudos

11-29-2022 12:45:22 PM

2 More Replies

by Stita • New Contributor II

10-06-2022 4:34:37 AM

3340 Views
1 replies
2 kudos

Resolved! How do we pass the row tags dynamically while reading a XML file into a dataframe?

I have a set of xml files where the row tags change dynamically. How can we achieve this scenario in databricks.df1=spark.read.format('xml').option('rootTag','XRoot').option('rowTag','PL1PLLL').load("dbfs:/FileStore/tables/ins/")We need to pass a val...

Data Engineering

3340 Views
1 replies
2 kudos

10-06-2022 4:34:37 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

10-14-2022 4:42:33 AM

2 kudos

If it is dynamically for the whole file, you can just use variabletag = 'PL1PLLL' df1=spark.read.format('xml').option('rootTag','XRoot').option('rowTag' ,tag).load("dbfs:/FileStore/tables/ins/file.xml")

2 kudos

10-14-2022 4:42:33 AM

by db-avengers2rul • Contributor II

10-04-2022 7:16:17 AM

26501 Views
0 replies
0 kudos

Max file size allowed to import into Databricks community edition ?

Hi All,I have few questions using the community edition 1) max file size that is allowed to be uploaded (data file) in community edition ?2) is XML file supported as well ? Regards,Rakesh

Data Engineering

26501 Views
0 replies
0 kudos

10-04-2022 7:16:17 AM

by PriyaTech • New Contributor

09-26-2022 11:47:08 PM

3826 Views
1 replies
2 kudos

Resolved! Converting Dataframe into Nested xml

e.g.dataframe is having firstname,lastname,middlename,id,salaryI need to convert dataframe in xml file but in nested format.output as nested xml<Name> <firatname> <middlename> <lastname> </Name><id></id><salary></salary>Anyone has ides ho...

Data Engineering

3826 Views
1 replies
2 kudos

09-26-2022 11:47:08 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

09-27-2022 2:42:38 AM

2 kudos

databricks has a xml connector:https://docs.databricks.com/data/data-sources/xml.htmlBasically you just define a df with the correct structure and write it to xml.To create a nested df, here you can find some info.

2 kudos

09-27-2022 2:42:38 AM

by Sha_1890 • New Contributor III

07-28-2022 4:43:31 AM

7109 Views
5 replies
3 kudos

java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class Error on writing a dataframe to a SQL DB in azure

I have to write the extracted data from XML to DB , i am using Dataframe for transformation and trying to load that to DB.I have installed these libraries,com.databricks:spark-xml_2.12:0.15.0com.microsoft.azure:spark-mssql-connector_2.11_2.4:1.0.2and...

Data Engineering

7109 Views
5 replies
3 kudos

07-28-2022 4:43:31 AM

View Replies

Latest Reply

Vidula
Honored Contributor

09-20-2022 3:07:18 AM

3 kudos

Hi @shafana Roohi Jahubar Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from ...

3 kudos

09-20-2022 3:07:18 AM

4 More Replies

by Paramesh • New Contributor II

09-01-2022 10:45:07 AM

4064 Views
3 replies
2 kudos

Resolved! How to read multiple tiny XML files in parallel

Hi team, we are trying to read multiple tiny XML files, able to parse them using the data bricks XML jar, but is there any way to read these files in parallel and distribute the load across the cluster? right now our job is taking 90% of the time rea...

Data Engineering

4064 Views
3 replies
2 kudos

09-01-2022 10:45:07 AM

View Replies

Latest Reply

Paramesh
New Contributor II

09-05-2022 7:11:38 AM

2 kudos

Thank you @Hubert Dudek for the suggestion. Similar to your recommendation, we added a step in our pipeline to merge the small files to large files and make them available for the spark job.

2 kudos

09-05-2022 7:11:38 AM

2 More Replies

by Sha_1890 • New Contributor III

08-13-2022 4:46:50 AM

1276 Views
0 replies
3 kudos

Longer execution time to write into the SQL server table from Spark Dataframe

I have 8gb of XML data loaded into different dataframes, there are two dataframes which has 24 lakh and 82 lakh data to be written to a 2 SQL server tables which is taking so 2 hrs and 5 hrs of time to write it. I am using the below cluster configura...

Data Engineering

1276 Views
0 replies
3 kudos

08-13-2022 4:46:50 AM

by Michael_Galli • Contributor III

07-26-2022 2:30:25 AM

3744 Views
4 replies
2 kudos

Resolved! Unittest in PySpark - how to read XML with Maven com.databricks.spark.xml ?

When writing unit tests with unittest / pytest in PySpark, reading mockup datasources with built-in datatypes like csv, json (spark.read.format("json")) works just fine.But when reading XML´s with spark.read.format("com.databricks.spark.xml") in the ...

Data Engineering

3744 Views
4 replies
2 kudos

07-26-2022 2:30:25 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

07-26-2022 5:51:39 AM

2 kudos

Please install spark-xml from Maven. As it is from Maven you need to install it for cluster which you are using in cluster settings (alternatively using API or CLI)https://mvnrepository.com/artifact/com.databricks/spark-xml

2 kudos

07-26-2022 5:51:39 AM

3 More Replies

by wyzer • Contributor II

04-12-2022 5:12:10 AM

5212 Views
8 replies
4 kudos

Unable to read an XML file of 9 GB

Hello,We have a large XML file (9 GB) that we can't read.We have this error : VM size limitBut how can we change the VM size limit ?We have tested many clusters, but no one can read this file.Thank you for your help.

Data Engineering

5212 Views
8 replies
4 kudos

04-12-2022 5:12:10 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

07-25-2022 2:14:39 PM

4 kudos

Hi @Salah K.,Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

4 kudos

07-25-2022 2:14:39 PM

7 More Replies