Performance Issue with XML Processing in Spark Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-19-2024 04:09 AM
I am reaching out to bring attention to a performance issue we are encountering while processing XML files using Spark-XML, particularly with the configuration spark.read().format("com.databricks.spark.xml").
Currently, we are experiencing significant processing times, with it taking approximately 35 minutes to process 12,000 files in our AWS EMR environment. Our EMR cluster is configured with 200 vCPUs, 500 GB memory, and 900 GB disk space. We are utilizing the Hudi upsert mode with a total of 30 executors, each with 16.6 GB of memory. However, despite these resources, we are observing suboptimal utilization, as indicated by the attached screenshot, where each executor is only utilizing 4-5 MB.
Upon further investigation, we noticed that the reading phase, during which Spark reads the files or plans to read them, takes approximately 8-10 minutes for the 12,000 files. However, the writing phase, following the processing, takes significantly longer, ranging from 25-30 minutes. We identified that the XmlRecordReader initialize method (referenced here: https://github.com/databricks/spark-xml/blob/v0.14.0/src/main/scala/com/databricks/spark/xml/XmlInpu...) is repeatedly called during the reading phase, which appears to introduce delays in processing paths from S3.
Please let me know if any futher information required
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-19-2024 09:51 AM
@amar1995 - Can you try this streaming approach and see if it works for your use case (using autoloader) - https://kb.databricks.com/streaming/stream-xml-auto-loader
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-22-2024 01:06 AM
I'm facing an issue with integrating the cloudFiles data source in my Spark application as you mention from above.
Dataset<Row> sparkReader = spark.readStream()
.format("cloudFiles")
.option("cloudFiles.useNotifications", "false")
.option("cloudFiles.format", "binaryFile")
.load("paths");
I've also included the spark-xml dependency in my pom.xml and pass the spark-xml jar on spark submit :
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.12</artifactId>
<version>0.18.0</version>
</dependency>
Despite all this, I'm encountering the following error:
org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: cloudFiles. Please find packages at `https://spark.apache.org/third-party-projects.html`.
It seems like the cloudFiles data source is not being recognized. we're using a scattered file path. For example, our output files are located in the following structured paths
files\data\20\
files\data\21\
files\data\22\
example files are = "files\data\xml\20\23\testxml.xml,files\data\20\24\testxml2.xml"
Any insights on what might be causing this issue and how I can resolve it would be greatly appreciated
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-22-2024 02:01 PM
@amar1995 - Can you please try using within the Databricks runtime.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-22-2024 11:07 PM
@shan_chandra As mention above we are using AWS EMR 6.15 environment
![](/skins/images/8C2A30E5B696B676846234E4B14F2C7B/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/8C2A30E5B696B676846234E4B14F2C7B/responsive_peak/images/icon_anonymous_message.png)