cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Set default "spark.driver.maxResultSize" from the notebook

mhansinger
New Contributor II

Hello,

I would like to set the default "spark.driver.maxResultSize" from the notebook on my cluster. I know I can do that in the cluster settings, but is there a way to set it by code?

I also know how to do it when I start a spark session, but in my case I directly load from the feature store and want to transform my pyspark data frame to pandas.

from databricks import feature_store
import pandas as pd
import pyspark.sql.functions as f
from os.path import join
 
fs = feature_store.FeatureStoreClient()
 
prediction_data = fs.read_table(name=NAME)
 
prediction_data_pd = prediction_data.toPandas()

1 ACCEPTED SOLUTION

Accepted Solutions

Atanu
Esteemed Contributor
Esteemed Contributor
6 REPLIES 6

Kaniz
Community Manager
Community Manager

Hi @Maximilian Hansinger​ ,

Please try this:-

from pyspark import SparkContext
from pyspark import SparkConf
 
conf = SparkConf()
          .setMaster('yarn') \
          .setAppName('xyz') \
          .set('spark.driver.extraClassPath', '/usr/local/bin/postgresql-42.2.5.jar') \
          .set('spark.executor.instances', 4) \
          .set('spark.executor.cores', 4) \
          .set('spark.executor.memory', '10g') \
          .set('spark.driver.memory', '15g') \
          .set('spark.memory.offHeap.enabled', True) \
          .set('spark.memory.offHeap.size', '20g') \
          .set('spark.dirver.maxResultSize', '4096') 
 
spark_context = SparkContext(conf=conf)

mhansinger
New Contributor II

Hi @Kaniz Fatma​  thanks for your reply.

Not sure if that helps. When I check after execution of your code with

spark.conf.get("spark.driver.maxResultSize")

I still get the default "spark.dirver.maxResultSize", instead of 4096.

Kaniz
Community Manager
Community Manager

Hi @Maximilian Hansinger​ , Alternatively try this:-

from pyspark.sql import SparkSession
 
spark = (SparkSession.builder
    .master('yarn') # depends on the cluster manager of your choice
    .appName('xyz')
    .config('spark.driver.extraClassPath', '/usr/local/bin/postgresql-42.2.5.jar')
    .config('spark.executor.instances', 4)
    .config('spark.executor.cores', 4)
    .config('spark.executor.memory', '10g')
    .config('spark.driver.memory', '15g')
    .config('spark.memory.offHeap.enabled', True)
    .config('spark.memory.offHeap.size', '20g')
    .config('spark.dirver.maxResultSize', '4096') 
)
sc = spark.sparkContext

Atanu
Esteemed Contributor
Esteemed Contributor

@Maximilian Hansinger​  may be you can follow this -

https://kb.databricks.com/jobs/job-fails-maxresultsize-exception.html

Anonymous
Not applicable

@Maximilian Hansinger​ - Would you let us know how it goes, please?

Anonymous
Not applicable

Hi @Maximilian Hansinger​ 

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark the answer as best? If not, please tell us so we can help you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.