Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-06-2024 08:09 AM
Thank you for your question. The error is likely caused by memory issues or inefficient processing of the large dataset. Parsing XML with XPath is resource-intensive, and handling 1 million records requires optimization.
You can try df = df.repartition(100), or increasing the spark.cpu.tasks ratio from 1 to 2, or increase the executors size, this will at least give you insights on how much is it trully required and if the data is fully and evenly parallelised, to later on tune it further.