I am trying to get all unique id from sql query and I always run out of memory
select concat_ws(';',view.MATNR,view.WERKS) from hive_metastore.dqaas.temp_view as view join hive_metastore.dqaas.t_dqaas_marc as marc on marc.MATNR = view.MATNR where view.WERKS NOT IN ('BR91', 'BR92', 'BR94', 'BR97', 'BR98', 'BR9A', 'BR9B', 'BR9C', 'BR9D', 'BR9L', 'BR9X','CN9S', 'XM93', 'ZA90', 'ZA93') and marc.HERKL = view.HERKL and marc.LVORM = ' '
with following code
var df = spark.sql("select concat_ws(';',view.MATNR,view.WERKS) from hive_metastore.dqaas.temp_view as view join hive_metastore.dqaas.t_dqaas_marc as marc on marc.MATNR = view.MATNR where view.WERKS NOT IN ('BR91', 'BR92', 'BR94', 'BR97', 'BR98', 'BR9A', 'BR9B', 'BR9C', 'BR9D', 'BR9L', 'BR9X','CN9S', 'XM93', 'ZA90', 'ZA93') and marc.HERKL = view.HERKL and marc.LVORM = ' '")
var distinctValue: Set[String] = df.rdd.mapPartitions(data => {
var unqiueIdSet = data.map(row => row.getAs[String](0)).toSet
Iterator(unqiueIdSet)
}).collect.flatten.toSet
data in temp_view = 5000
data in t_dqaas_marc= 22354457
output of query gives me 4 lakhs plus records
exception I am getting
Job aborted due to stage failure: Photon ran out of memory while executing this query.
Photon failed to reserve 512.0 MiB for hash table buckets, in SparseHashedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=59815).
Memory usage:
BroadcastHashedRelation(spark_plan_id=59815): allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
BuildHashedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
PartitionedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
partition 0: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
rows: allocated 890.0 MiB, tracked 890.0 MiB, untracked allocated 0.0 B, peak 890.0 MiB
var-len data: allocated 420.0 MiB, tracked 420.0 MiB, untracked allocated 0.0 B, peak 420.0 MiB
SparseHashedRelation: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table var-len key data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table payloads: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table buckets: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Caused by: SparkOutOfMemoryError: Photon ran out of memory while executing this query.
Photon failed to reserve 512.0 MiB for hash table buckets, in SparseHashedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=59815).
Memory usage:
BroadcastHashedRelation(spark_plan_id=59815): allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
BuildHashedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
PartitionedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
partition 0: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
rows: allocated 890.0 MiB, tracked 890.0 MiB, untracked allocated 0.0 B, peak 890.0 MiB
var-len data: allocated 420.0 MiB, tracked 420.0 MiB, untracked allocated 0.0 B, peak 420.0 MiB
SparseHashedRelation: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table var-len key data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table payloads: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table buckets: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
My cluster configuration
2-16 Workers
32-256 GB Memory8-64 Cores1 Driver
16 GB Memory, 4 CoresRuntime
11.3.x-scala2.12
Photon
Standard_D4as_v5
6–34 DBU/h
I have tried saving the output to another table i ran into same issue.
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(tempTableName)
No matter what operation I do on above sql query dataframe, I always end up into out of memory exception