cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to read an XML file of 9 GB

wyzer
Contributor II

Hello,

We have a large XML file (9 GB) that we can't read.

We have this error : VM size limit

But how can we change the VM size limit ?

We have tested many clusters, but no one can read this file.

Thank you for your help.

9 REPLIES 9

RKNutalapati
Valued Contributor

Hi @Salah K.​  : What is the cluster size / configuration? pls share your code snippet.

{

  "autoscale": {

    "min_workers": 2,

    "max_workers": 8

  },

  "cluster_name": "GrosCluster",

  "spark_version": "10.4.x-scala2.12",

  "spark_conf": {

    "spark.databricks.delta.preview.enabled": "true"

  },

  "azure_attributes": {

    "first_on_demand": 1,

    "availability": "SPOT_WITH_FALLBACK_AZURE",

    "spot_bid_max_price": -1

  },

  "node_type_id": "Standard_L8s",

  "driver_node_type_id": "Standard_L8s",

  "ssh_public_keys": [],

  "custom_tags": {},

  "spark_env_vars": {

    "PYSPARK_PYTHON": "/databricks/python3/bin/python3"

  },

  "autotermination_minutes": 120,

  "enable_elastic_disk": true,

  "cluster_source": "UI",

  "init_scripts": [],

  "cluster_id": "0408-123105-xj70dm6w"

}

RKNutalapati
Valued Contributor

@Salah K.​ , Did you tried with "Memory-optimized" cluster? My wild guess here is that it is doing a single thread operation and that thread does not have enough memory. Ensure each thread in the cluster has more than 9 GB of memory.

Is your code has the InferSchema option enabled?

wyzer
Contributor II

@Rama Krishna N​ , Yes we tried the "Memory-optimized" cluster.

And no, we didn't change the thread in the cluster.

How do you do that please?

RKNutalapati
Valued Contributor

Hi @Salah K.​  - I am sorry for the confusion. I mean to say use bigger cluster and verify. Something like below.

Standard_M8ms 3

vCPU = 8

Memory: GiB = 218

https://docs.microsoft.com/en-us/azure/virtual-machines/m-series

I am not that familiar with Azure. I am also doing xml parsing in AWS workspace. But the files I am loading are not this huge.

val df = spark.

read.format("com.databricks.spark.xml")

.option("rowTag", "<MyRowTag>")  

.option("rootTag", "<MyRoootTag>")

.load("<XML File Path>")

Turn off the below option, if it is true.

.option("inferschema", "false")

Kaniz
Community Manager
Community Manager

Hi @Salah K.​ , Would you like to try @Rama Krishna N​ 's suggestions?

wyzer
Contributor II

Hi,

Yes I want to try it, but I don't know how to change the memory thread in the cluster.

Atanu
Esteemed Contributor
Esteemed Contributor

hello @Salah K.​ you can try configuring spark.executor.memory from cluster spark configuration.

total_executor_memory = (total_ram_per_node -1) / executor_per_node

total_executor_memory = (64–1)/3 = 21(rounded down)

spark.executor.memory = total_executor_memory * 0.9

spark.executor.memory = 21*0.9 = 18 (rounded down)

memory_overhead = 21*0.1 = 3 (rounded up)

jose_gonzalez
Moderator
Moderator

Hi @Salah K.​,

Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.