04-12-2022 05:12 AM
Hello,
We have a large XML file (9 GB) that we can't read.
We have this error : VM size limit
But how can we change the VM size limit ?
We have tested many clusters, but no one can read this file.
Thank you for your help.
04-12-2022 09:12 PM
Hi @Salah K. : What is the cluster size / configuration? pls share your code snippet.
04-13-2022 12:23 AM
{
"autoscale": {
"min_workers": 2,
"max_workers": 8
},
"cluster_name": "GrosCluster",
"spark_version": "10.4.x-scala2.12",
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true"
},
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_L8s",
"driver_node_type_id": "Standard_L8s",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 120,
"enable_elastic_disk": true,
"cluster_source": "UI",
"init_scripts": [],
"cluster_id": "0408-123105-xj70dm6w"
}
04-13-2022 01:15 AM
@Salah K. , Did you tried with "Memory-optimized" cluster? My wild guess here is that it is doing a single thread operation and that thread does not have enough memory. Ensure each thread in the cluster has more than 9 GB of memory.
Is your code has the InferSchema option enabled?
04-13-2022 01:52 AM
@Rama Krishna N , Yes we tried the "Memory-optimized" cluster.
And no, we didn't change the thread in the cluster.
How do you do that please?
04-13-2022 02:54 AM
Hi @Salah K. - I am sorry for the confusion. I mean to say use bigger cluster and verify. Something like below.
Standard_M8ms 3
vCPU = 8
Memory: GiB = 218
https://docs.microsoft.com/en-us/azure/virtual-machines/m-series
I am not that familiar with Azure. I am also doing xml parsing in AWS workspace. But the files I am loading are not this huge.
val df = spark.
read.format("com.databricks.spark.xml")
.option("rowTag", "<MyRowTag>")
.option("rootTag", "<MyRoootTag>")
.load("<XML File Path>")
Turn off the below option, if it is true.
.option("inferschema", "false")
04-13-2022 02:00 AM
Hi @Salah K. , Would you like to try @Rama Krishna N 's suggestions?
04-13-2022 02:03 AM
Hi,
Yes I want to try it, but I don't know how to change the memory thread in the cluster.
05-21-2022 06:48 AM
hello @Salah K. you can try configuring spark.executor.memory from cluster spark configuration.
total_executor_memory = (total_ram_per_node -1) / executor_per_node
total_executor_memory = (64–1)/3 = 21(rounded down)
spark.executor.memory = total_executor_memory * 0.9
spark.executor.memory = 21*0.9 = 18 (rounded down)
memory_overhead = 21*0.1 = 3 (rounded up)
07-25-2022 02:14 PM
Hi @Salah K.,
Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.