cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

"Photon ran out of memory" while when trying to get the unique Id from sql query

rusty
New Contributor II

I am trying to get all unique id from sql query and I always run out of memory

select concat_ws(';',view.MATNR,view.WERKS) from hive_metastore.dqaas.temp_view as view join hive_metastore.dqaas.t_dqaas_marc as marc on marc.MATNR = view.MATNR where view.WERKS NOT IN ('BR91', 'BR92', 'BR94', 'BR97', 'BR98', 'BR9A', 'BR9B', 'BR9C', 'BR9D', 'BR9L', 'BR9X','CN9S', 'XM93', 'ZA90', 'ZA93') and marc.HERKL = view.HERKL and marc.LVORM  = ' '

with following code

var df = spark.sql("select concat_ws(';',view.MATNR,view.WERKS) from hive_metastore.dqaas.temp_view as view join hive_metastore.dqaas.t_dqaas_marc as marc on marc.MATNR = view.MATNR where view.WERKS NOT IN ('BR91', 'BR92', 'BR94', 'BR97', 'BR98', 'BR9A', 'BR9B', 'BR9C', 'BR9D', 'BR9L', 'BR9X','CN9S', 'XM93', 'ZA90', 'ZA93') and marc.HERKL = view.HERKL and marc.LVORM  = ' '")
 
var distinctValue: Set[String] = df.rdd.mapPartitions(data => {
    var unqiueIdSet = data.map(row => row.getAs[String](0)).toSet
    Iterator(unqiueIdSet)
}).collect.flatten.toSet

data in temp_view = 5000

data in t_dqaas_marc= 22354457

output of query gives me 4 lakhs plus records

exception I am getting

Job aborted due to stage failure: Photon ran out of memory while executing this query.
Photon failed to reserve 512.0 MiB for hash table buckets, in SparseHashedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=59815).
Memory usage:
BroadcastHashedRelation(spark_plan_id=59815): allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
  BuildHashedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
    PartitionedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
      partition 0: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
        rows: allocated 890.0 MiB, tracked 890.0 MiB, untracked allocated 0.0 B, peak 890.0 MiB
        var-len data: allocated 420.0 MiB, tracked 420.0 MiB, untracked allocated 0.0 B, peak 420.0 MiB
    SparseHashedRelation: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table var-len key data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table payloads: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table buckets: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
 
Caused by: SparkOutOfMemoryError: Photon ran out of memory while executing this query.
Photon failed to reserve 512.0 MiB for hash table buckets, in SparseHashedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=59815).
Memory usage:
BroadcastHashedRelation(spark_plan_id=59815): allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
  BuildHashedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
    PartitionedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
      partition 0: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
        rows: allocated 890.0 MiB, tracked 890.0 MiB, untracked allocated 0.0 B, peak 890.0 MiB
        var-len data: allocated 420.0 MiB, tracked 420.0 MiB, untracked allocated 0.0 B, peak 420.0 MiB
    SparseHashedRelation: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table var-len key data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table payloads: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table buckets: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B

My cluster configuration

2-16 Workers

32-256 GB Memory8-64 Cores1 Driver

16 GB Memory, 4 CoresRuntime

11.3.x-scala2.12

Photon

Standard_D4as_v5

6–34 DBU/h

I have tried saving the output to another table i ran into same issue.

df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(tempTableName)

No matter what operation I do on above sql query dataframe, I always end up into out of memory exception

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

that collect statement moves all data to the driver. So you lose all parallelism and the driver has to do all the processing. If you beef up your driver, it might work.

View solution in original post

2 REPLIES 2

-werners-
Esteemed Contributor III

that collect statement moves all data to the driver. So you lose all parallelism and the driver has to do all the processing. If you beef up your driver, it might work.

Anonymous
Not applicable

Hi @Anil Kumar Chauhan​ 

We haven't heard from you since the last response from @Werner Stinckens​  . Kindly share the information with us, and in return, we will provide you with the necessary solution.

Thanks and Regards

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group