Topics with Label: Best Practices

by Anonymous • Not applicable

06-08-2021 7:26:50 PM

6062 Views
1 replies
0 kudos

Resolved! Ideal number and size of partitions

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So ho...

Data Engineering

6062 Views
1 replies
0 kudos

06-08-2021 7:26:50 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-09-2021 3:35:00 AM

0 kudos

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the valueHowever if you have multiple workloads with different data volumes, instead ...

0 kudos

06-09-2021 3:35:00 AM

by Anonymous • Not applicable

06-04-2021 12:48:09 PM

743 Views
2 replies
0 kudos

Resolved! Best practices to query logs

We dump our logs in S3 currently. Can you give us best practices to make these logs easier to query?

Data Engineering

743 Views
2 replies
0 kudos

06-04-2021 12:48:09 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-07-2021 7:14:00 AM

0 kudos

And if it is generic logs which gets landed on S3 , it'd be worth taking a look at Autoloader. Here is a blog post on processing crowdstrike logs in a similar way

0 kudos

06-07-2021 7:14:00 AM

1 More Replies

Databricks

Forum Posts

Resolved! Ideal number and size of partitions

Resolved! Best practices to query logs