Databricks Community

rusty · ‎06-08-2023

I am trying to get all unique id from sql query and I always run out of memory

select concat_ws(';',view.MATNR,view.WERKS) from hive_metastore.dqaas.temp_view as view join hive_metastore.dqaas.t_dqaas_marc as marc on marc.MATNR = view.MATNR where view.WERKS NOT IN ('BR91', 'BR92', 'BR94', 'BR97', 'BR98', 'BR9A', 'BR9B', 'BR9C', 'BR9D', 'BR9L', 'BR9X','CN9S', 'XM93', 'ZA90', 'ZA93') and marc.HERKL = view.HERKL and marc.LVORM  = ' '

with following code

var df = spark.sql("select concat_ws(';',view.MATNR,view.WERKS) from hive_metastore.dqaas.temp_view as view join hive_metastore.dqaas.t_dqaas_marc as marc on marc.MATNR = view.MATNR where view.WERKS NOT IN ('BR91', 'BR92', 'BR94', 'BR97', 'BR98', 'BR9A', 'BR9B', 'BR9C', 'BR9D', 'BR9L', 'BR9X','CN9S', 'XM93', 'ZA90', 'ZA93') and marc.HERKL = view.HERKL and marc.LVORM  = ' '")
 
var distinctValue: Set[String] = df.rdd.mapPartitions(data => {
    var unqiueIdSet = data.map(row => row.getAs[String](0)).toSet
    Iterator(unqiueIdSet)
}).collect.flatten.toSet

data in temp_view = 5000

data in t_dqaas_marc= 22354457

output of query gives me 4 lakhs plus records

exception I am getting

Job aborted due to stage failure: Photon ran out of memory while executing this query.
Photon failed to reserve 512.0 MiB for hash table buckets, in SparseHashedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=59815).
Memory usage:
BroadcastHashedRelation(spark_plan_id=59815): allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
  BuildHashedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
    PartitionedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
      partition 0: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
        rows: allocated 890.0 MiB, tracked 890.0 MiB, untracked allocated 0.0 B, peak 890.0 MiB
        var-len data: allocated 420.0 MiB, tracked 420.0 MiB, untracked allocated 0.0 B, peak 420.0 MiB
    SparseHashedRelation: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table var-len key data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table payloads: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table buckets: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
 
Caused by: SparkOutOfMemoryError: Photon ran out of memory while executing this query.
Photon failed to reserve 512.0 MiB for hash table buckets, in SparseHashedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=59815).
Memory usage:
BroadcastHashedRelation(spark_plan_id=59815): allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
  BuildHashedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
    PartitionedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
      partition 0: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
        rows: allocated 890.0 MiB, tracked 890.0 MiB, untracked allocated 0.0 B, peak 890.0 MiB
        var-len data: allocated 420.0 MiB, tracked 420.0 MiB, untracked allocated 0.0 B, peak 420.0 MiB
    SparseHashedRelation: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table var-len key data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table payloads: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table buckets: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B

My cluster configuration

2-16 Workers

32-256 GB Memory8-64 Cores1 Driver

16 GB Memory, 4 CoresRuntime

11.3.x-scala2.12

Photon

Standard_D4as_v5

6–34 DBU/h

I have tried saving the output to another table i ran into same issue.

df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(tempTableName)

No matter what operation I do on above sql query dataframe, I always end up into out of memory exception

-werners- · ‎06-09-2023

that collect statement moves all data to the driver. So you lose all parallelism and the driver has to do all the processing. If you beef up your driver, it might work.

View solution in original post

-werners- · ‎06-09-2023

that collect statement moves all data to the driver. So you lose all parallelism and the driver has to do all the processing. If you beef up your driver, it might work.

Anonymous · ‎06-13-2023

Hi @Anil Kumar Chauhan

We haven't heard from you since the last response from @Werner Stinckens . Kindly share the information with us, and in return, we will provide you with the necessary solution.

Thanks and Regards