Databricks Community

leungi · ‎07-17-2024

Background

Using R language's {sparklyr} package to fetch data from tables in Unity Catalog, and faced the error below.

Tried the following, to no avail:

Using memory optimized cluster - e.g., E4d.
Using bigger (RAM) cluster - e.g., E8d.
Enable auto-scaling.
Setting spark config:
- spark.driver.maxResultSize 4096
- spark.memory.offHeap.enabled true
- spark.driver.memory 8082
- spark.executor.instances 4
- spark.memory.offHeap.size 7284
- spark.executor.memory 7284
- spark.executor.cores 4

Error

Error : org.apache.spark.memory.SparkOutOfMemoryError: Total memory usage during row decode exceeds spark.driver.maxResultSize (4.0 GiB). The average row size was 48.0 B, with 2.9 GiB used for temporary buffers. Run `sparklyr::spark_last_error()` to see the full Spark error (multiple lines) To use the previous style of error message set `options("sparklyr.simple.errors" = TRUE)` Error:

Error: ! org.apache.spark.memory.SparkOutOfMemoryError: Total memory usage during row decode exceeds spark.driver.maxResultSize (4.0 GiB). The average row size was 48.0 B, with 2.9 GiB used for temporary buffers. Run `sparklyr::spark_last_error()` to see the full Spark error (multiple lines) To use the previous style of error message set `options("sparklyr.simple.errors" = TRUE)`

Kaniz_Fatma · ‎07-18-2024

Hi @leungi,

Since the error indicates that the total memory usage during row decode exceeds spark.driver.maxResultSize, you might try increasing this value beyond 4.0 GiB.
Repartition your data to increase the number of partitions. This can help distribute the data more evenly across the cluster and reduce the memory load on indiv...
Ensure that the memory configurations are set appropriately. For example:
- spark.driver.memory and spark.executor.memory should be set to values that your cluster can handle.
- spark.memory.offHeap.size should be adjusted based on your off-heap memory requirements.
Try to optimize your data processing logic to reduce memory consumption. This might include filtering out unnecessary data early in the processing pipeline or using more mem...
If you have large lookup tables or static data, consider using broadcast variables to distribute this data efficiently across the cluster.
Use Spark’s monitoring tools to identify which stages of your job are consuming the most memory. This can help you pinpoint specific areas to optimize.

If these suggestions don’t resolve the issue, you might need to provide more details about your data and processing logic for a more tailored solution. Let me know if you need further assistance!

leungi · ‎07-22-2024

@Kaniz_Fatma , thanks for the detailed suggestions.

I believe the first reference relates to the issue; however, after adjusting spark.driver.maxResultSize to various values - e.g., 10g, 20g, 30g - a new error ensues (see below).

The operation involves a collect() on a Delta table with 380 MM rows and 5 columns (3.2GB, partitioned into 55 files). If the average row size is 48Bytes (per initial error), shouldn't 20GBytes be sufficient?

New Error

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1367)