Unable to read an XML file of 9 GB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-12-2022 05:12 AM
Hello,
We have a large XML file (9 GB) that we can't read.
We have this error : VM size limit
But how can we change the VM size limit ?
We have tested many clusters, but no one can read this file.
Thank you for your help.
- Labels:
-
Large XML File
-
VM Size Limit
-
Xml
-
XML File
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-12-2022 09:12 PM
Hi @Salah K. : What is the cluster size / configuration? pls share your code snippet.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-13-2022 12:23 AM
{
"autoscale": {
"min_workers": 2,
"max_workers": 8
},
"cluster_name": "GrosCluster",
"spark_version": "10.4.x-scala2.12",
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true"
},
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_L8s",
"driver_node_type_id": "Standard_L8s",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 120,
"enable_elastic_disk": true,
"cluster_source": "UI",
"init_scripts": [],
"cluster_id": "0408-123105-xj70dm6w"
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-13-2022 01:15 AM
@Salah K. , Did you tried with "Memory-optimized" cluster? My wild guess here is that it is doing a single thread operation and that thread does not have enough memory. Ensure each thread in the cluster has more than 9 GB of memory.
Is your code has the InferSchema option enabled?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-13-2022 01:52 AM
@Rama Krishna N , Yes we tried the "Memory-optimized" cluster.
And no, we didn't change the thread in the cluster.
How do you do that please?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-13-2022 02:54 AM
Hi @Salah K. - I am sorry for the confusion. I mean to say use bigger cluster and verify. Something like below.
Standard_M8ms 3
vCPU = 8
Memory: GiB = 218
https://docs.microsoft.com/en-us/azure/virtual-machines/m-series
I am not that familiar with Azure. I am also doing xml parsing in AWS workspace. But the files I am loading are not this huge.
val df = spark.
read.format("com.databricks.spark.xml")
.option("rowTag", "<MyRowTag>")
.option("rootTag", "<MyRoootTag>")
.load("<XML File Path>")
Turn off the below option, if it is true.
.option("inferschema", "false")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-13-2022 02:03 AM
Hi,
Yes I want to try it, but I don't know how to change the memory thread in the cluster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-21-2022 06:48 AM
hello @Salah K. you can try configuring spark.executor.memory from cluster spark configuration.
total_executor_memory = (total_ram_per_node -1) / executor_per_node
total_executor_memory = (64–1)/3 = 21(rounded down)
spark.executor.memory = total_executor_memory * 0.9
spark.executor.memory = 21*0.9 = 18 (rounded down)
memory_overhead = 21*0.1 = 3 (rounded up)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2022 02:14 PM
Hi @Salah K.,
Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

