- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-08-2023 10:34 PM
I am trying to get all unique id from sql query and I always run out of memory
select concat_ws(';',view.MATNR,view.WERKS) from hive_metastore.dqaas.temp_view as view join hive_metastore.dqaas.t_dqaas_marc as marc on marc.MATNR = view.MATNR where view.WERKS NOT IN ('BR91', 'BR92', 'BR94', 'BR97', 'BR98', 'BR9A', 'BR9B', 'BR9C', 'BR9D', 'BR9L', 'BR9X','CN9S', 'XM93', 'ZA90', 'ZA93') and marc.HERKL = view.HERKL and marc.LVORM = ' '
with following code
var df = spark.sql("select concat_ws(';',view.MATNR,view.WERKS) from hive_metastore.dqaas.temp_view as view join hive_metastore.dqaas.t_dqaas_marc as marc on marc.MATNR = view.MATNR where view.WERKS NOT IN ('BR91', 'BR92', 'BR94', 'BR97', 'BR98', 'BR9A', 'BR9B', 'BR9C', 'BR9D', 'BR9L', 'BR9X','CN9S', 'XM93', 'ZA90', 'ZA93') and marc.HERKL = view.HERKL and marc.LVORM = ' '")
var distinctValue: Set[String] = df.rdd.mapPartitions(data => {
var unqiueIdSet = data.map(row => row.getAs[String](0)).toSet
Iterator(unqiueIdSet)
}).collect.flatten.toSet
data in temp_view = 5000
data in t_dqaas_marc= 22354457
output of query gives me 4 lakhs plus records
exception I am getting
Job aborted due to stage failure: Photon ran out of memory while executing this query.
Photon failed to reserve 512.0 MiB for hash table buckets, in SparseHashedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=59815).
Memory usage:
BroadcastHashedRelation(spark_plan_id=59815): allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
BuildHashedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
PartitionedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
partition 0: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
rows: allocated 890.0 MiB, tracked 890.0 MiB, untracked allocated 0.0 B, peak 890.0 MiB
var-len data: allocated 420.0 MiB, tracked 420.0 MiB, untracked allocated 0.0 B, peak 420.0 MiB
SparseHashedRelation: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table var-len key data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table payloads: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table buckets: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Caused by: SparkOutOfMemoryError: Photon ran out of memory while executing this query.
Photon failed to reserve 512.0 MiB for hash table buckets, in SparseHashedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=59815).
Memory usage:
BroadcastHashedRelation(spark_plan_id=59815): allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
BuildHashedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
PartitionedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
partition 0: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
rows: allocated 890.0 MiB, tracked 890.0 MiB, untracked allocated 0.0 B, peak 890.0 MiB
var-len data: allocated 420.0 MiB, tracked 420.0 MiB, untracked allocated 0.0 B, peak 420.0 MiB
SparseHashedRelation: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table var-len key data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table payloads: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
hash table buckets: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
My cluster configuration
2-16 Workers
32-256 GB Memory8-64 Cores1 Driver
16 GB Memory, 4 CoresRuntime
11.3.x-scala2.12
Photon
Standard_D4as_v5
6–34 DBU/h
I have tried saving the output to another table i ran into same issue.
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(tempTableName)
No matter what operation I do on above sql query dataframe, I always end up into out of memory exception
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-09-2023 02:59 AM
that collect statement moves all data to the driver. So you lose all parallelism and the driver has to do all the processing. If you beef up your driver, it might work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-09-2023 02:59 AM
that collect statement moves all data to the driver. So you lose all parallelism and the driver has to do all the processing. If you beef up your driver, it might work.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-13-2023 11:15 PM
Hi @Anil Kumar Chauhan
We haven't heard from you since the last response from @Werner Stinckens . Kindly share the information with us, and in return, we will provide you with the necessary solution.
Thanks and Regards

