cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Ankith
by New Contributor
  • 1555 Views
  • 2 replies
  • 1 kudos

Resolved! How to enable spark.shuffle.compress in spark 3.3.0 or above versions?

when I try to set that I got a following error, appreciate your comments, thanks you in advance.

image.png
  • 1555 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Ankith Patlolla​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 1 kudos
1 More Replies
alejandrofm
by Valued Contributor
  • 1834 Views
  • 2 replies
  • 2 kudos

Resolved! Lot of write shuffle on optimize + ZORDER, is it normal?

Hi! I'm optimizing several Tb of partitioned data on ZSTD lvl 9.It surprises me the level of shuffle write, it could make sense because of ZORDER but I want to be sure that I'm not missing something, here is some context: Could I be missing something...

image image.png image
  • 1834 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Alejandro Martinez​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best an...

  • 2 kudos
1 More Replies
user_b22ce5eeAl
by New Contributor II
  • 915 Views
  • 2 replies
  • 0 kudos

pandas udf type grouped map fails

Hello, I am trying to get the shap values for my whole dataset using pandas udf for each category of a categorical variable. It runs well when I run it on a few categories but when I want to run the function on the whole dataset my job fails. I see ...

  • 915 Views
  • 2 replies
  • 0 kudos
Latest Reply
Jackson
New Contributor II
  • 0 kudos

I want to use data.groupby.apply() to apply a function to each row of my Pyspark Dataframe per group.I used The Grouped Map Pandas UDFs. However I can't figure out how to add another argument to my function. DGCustomerFirst SurveyI tried using the ar...

  • 0 kudos
1 More Replies
Anonymous
by Not applicable
  • 10391 Views
  • 1 replies
  • 0 kudos

Resolved! Tuning shuffle partitions

Is the best practice for tuning shuffle partitions to have the config "autoOptimizeShuffle.enabled" on? I see it is not switched on by default. Why is that?

  • 10391 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

 AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size wi...

  • 0 kudos
Labels