cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Anonymous
by Not applicable
  • 6062 Views
  • 1 replies
  • 0 kudos

Resolved! Ideal number and size of partitions

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So ho...

  • 6062 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the valueHowever if you have multiple workloads with different data volumes, instead ...

  • 0 kudos
Anonymous
by Not applicable
  • 743 Views
  • 2 replies
  • 0 kudos

Resolved! Best practices to query logs

We dump our logs in S3 currently. Can you give us best practices to make these logs easier to query?

  • 743 Views
  • 2 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

And if it is generic logs which gets landed on S3 , it'd be worth taking a look at Autoloader. Here is a blog post on processing crowdstrike logs in a similar way

  • 0 kudos
1 More Replies
Labels