cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to read an XML file of 9 GB

wyzer
Contributor II

Hello,

We have a large XML file (9 GB) that we can't read.

We have this error : VM size limit

But how can we change the VM size limit ?

We have tested many clusters, but no one can read this file.

Thank you for your help.

9 REPLIES 9

RKNutalapati
Valued Contributor

Hi @Salah K.​  : What is the cluster size / configuration? pls share your code snippet.

{

  "autoscale": {

    "min_workers": 2,

    "max_workers": 8

  },

  "cluster_name": "GrosCluster",

  "spark_version": "10.4.x-scala2.12",

  "spark_conf": {

    "spark.databricks.delta.preview.enabled": "true"

  },

  "azure_attributes": {

    "first_on_demand": 1,

    "availability": "SPOT_WITH_FALLBACK_AZURE",

    "spot_bid_max_price": -1

  },

  "node_type_id": "Standard_L8s",

  "driver_node_type_id": "Standard_L8s",

  "ssh_public_keys": [],

  "custom_tags": {},

  "spark_env_vars": {

    "PYSPARK_PYTHON": "/databricks/python3/bin/python3"

  },

  "autotermination_minutes": 120,

  "enable_elastic_disk": true,

  "cluster_source": "UI",

  "init_scripts": [],

  "cluster_id": "0408-123105-xj70dm6w"

}

RKNutalapati
Valued Contributor

@Salah K.​ , Did you tried with "Memory-optimized" cluster? My wild guess here is that it is doing a single thread operation and that thread does not have enough memory. Ensure each thread in the cluster has more than 9 GB of memory.

Is your code has the InferSchema option enabled?

wyzer
Contributor II

@Rama Krishna N​ , Yes we tried the "Memory-optimized" cluster.

And no, we didn't change the thread in the cluster.

How do you do that please?

RKNutalapati
Valued Contributor

Hi @Salah K.​  - I am sorry for the confusion. I mean to say use bigger cluster and verify. Something like below.

Standard_M8ms 3

vCPU = 8

Memory: GiB = 218

https://docs.microsoft.com/en-us/azure/virtual-machines/m-series

I am not that familiar with Azure. I am also doing xml parsing in AWS workspace. But the files I am loading are not this huge.

val df = spark.

read.format("com.databricks.spark.xml")

.option("rowTag", "<MyRowTag>")  

.option("rootTag", "<MyRoootTag>")

.load("<XML File Path>")

Turn off the below option, if it is true.

.option("inferschema", "false")

Kaniz_Fatma
Community Manager
Community Manager

Hi @Salah K.​ , Would you like to try @Rama Krishna N​ 's suggestions?

Hi,

Yes I want to try it, but I don't know how to change the memory thread in the cluster.

Atanu
Esteemed Contributor
Esteemed Contributor

hello @Salah K.​ you can try configuring spark.executor.memory from cluster spark configuration.

total_executor_memory = (total_ram_per_node -1) / executor_per_node

total_executor_memory = (64–1)/3 = 21(rounded down)

spark.executor.memory = total_executor_memory * 0.9

spark.executor.memory = 21*0.9 = 18 (rounded down)

memory_overhead = 21*0.1 = 3 (rounded up)

jose_gonzalez
Moderator
Moderator

Hi @Salah K.​,

Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!