by
guruv
• New Contributor III
- 3464 Views
- 5 replies
- 1 kudos
HI,I am running a Notebook job calling a JAR code (application code implmented in C#). in the Spark UI page for almost 2 hrs, it'w not showing any tasks and even the CPU usage is below 20%, memory usage is very small. Before this 2 hr window it shows...
- 3464 Views
- 5 replies
- 1 kudos
Latest Reply
Hi @guruv​ , Does @Atanu Sarkar​ 's response answer your query?
4 More Replies
- 766 Views
- 1 replies
- 0 kudos
I was expecting filter operations to be pushed down to Redshift by the optimizer. However, the entire dataset is getting loaded from Redshift.
- 766 Views
- 1 replies
- 0 kudos
Latest Reply
The Spark driver for Redshift pushes the following operators down into Redshift:FilterProjectSortLimitAggregationJoinHowever, it does not support expressions operating on dates and timestamps today. If you have a similar requirement, please add a fea...
- 1294 Views
- 1 replies
- 0 kudos
I am trying to re-optimize the a delta table with a max file size of 32 MB. But after changing spark.databricks.delta.optimize.maxFileSize and trying to optimize a partition, it doesn't split larger files to smaller ones. How can i get it to work.
- 1294 Views
- 1 replies
- 0 kudos
Latest Reply
spark.databricks.delta.optimize.maxFileSize controls the target size to binpack files when you run OPTIMIZE command. But it will not split larger files to smaller ones today. File splitting happens when ZORDER is ran however.
- 658 Views
- 2 replies
- 1 kudos
What are best practices for Spark streaming in Databricksis it good idea to consume multiple topics in one streaming jobis Auto scaling recommended for spark streamingHow many worker nodes we should choose for streaming jobWhen should we run OPTIMIZE...
- 658 Views
- 2 replies
- 1 kudos
Latest Reply
See our docs for other considerations when deploying a production streaming job.
1 More Replies
- 664 Views
- 1 replies
- 0 kudos
I know the skew in my dataset has the potential to cause issues with my job performance, so just wondering if there is anything I can do to help my performance other than repartitioning the whole dataset.
- 664 Views
- 1 replies
- 0 kudos
Latest Reply
For scenarios like this, it is recommend to use a cluster with Databricks Runtime 7.3 LTS or above where AQE is enabled. AQE dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed) skewed tasks into ...
- 1117 Views
- 1 replies
- 0 kudos
Where do I start when starting performance tuning of my queries? Are there particular things I should be looking out for?
- 1117 Views
- 1 replies
- 0 kudos
Latest Reply
Few things on top of my mind.1) Check Spark UI and check which stage is taking more time.2) Check for data skewing3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join h...
- 616 Views
- 1 replies
- 0 kudos
How does Catalyst optimizer improves the performances, what is its role?
- 616 Views
- 1 replies
- 0 kudos
Latest Reply
Catalyst optimizer converts unresolved logical plan into executable physical plan, deep dive is available here
- 455 Views
- 0 replies
- 5 kudos
Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia):[Note: This list is not exhaustive]Leverage the DataFrame or SparkSQL API’s first. They use the same execution process resulting in parity in performance but they also com...
- 455 Views
- 0 replies
- 5 kudos
- 8147 Views
- 4 replies
- 0 kudos
I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error.
Look like it fails on shuffle stage. Approx number of mappers is 150,000
Spark config:spark.sql.warehouse.dir hdfs:...
- 8147 Views
- 4 replies
- 0 kudos
Latest Reply
@Satheessh Chinnusamy how did you solve the above issue
3 More Replies
- 8890 Views
- 4 replies
- 0 kudos
Hello community,First let me introduce my use case, i daily receive a 500 million rows like so :ID | Categories
1 | cat1, cat2, cat3, ..., catn
2 | cat1, catx, caty, ..., anothercategory
Input data: 50 compressed csv files each file is 250 MB ...
- 8890 Views
- 4 replies
- 0 kudos
Latest Reply
So you are basically creating an inverted index ?
3 More Replies
- 5351 Views
- 5 replies
- 0 kudos
A cartesian product is a common operation to get the cross product of two tables.
For example, say you have a list of customers and a list of your product catalog and want to get the cross product of all customer - product combinations.
Cartesian pr...
- 5351 Views
- 5 replies
- 0 kudos
Latest Reply
Hi buddies, it is great written piece entirely defined, continue the good work constantly.
4 More Replies