Topics with Label: Performance

Forum Posts

Sorted by:

by guruv • New Contributor III

01-15-2022 1:18:38 AM

3464 Views
5 replies
1 kudos

Resolved! Saprk UI not showing any running tasks

HI,I am running a Notebook job calling a JAR code (application code implmented in C#). in the Spark UI page for almost 2 hrs, it'w not showing any tasks and even the CPU usage is below 20%, memory usage is very small. Before this 2 hr window it shows...

Data Engineering

3464 Views
5 replies
1 kudos

01-15-2022 1:18:38 AM

View Replies

Latest Reply

Kaniz
Community Manager

01-31-2022 3:36:45 AM

1 kudos

Hi @guruv , Does @Atanu Sarkar 's response answer your query?

1 kudos

01-31-2022 3:36:45 AM

4 More Replies

by sajith_appukutt • Honored Contributor II

06-09-2021 1:28:20 AM

766 Views
1 replies
0 kudos

Resolved! I'm using the Redshift data source to load data into spark SQL data frames. However, I'm not seeing predicate push down for my queries ran on Redshift - is that expected?

I was expecting filter operations to be pushed down to Redshift by the optimizer. However, the entire dataset is getting loaded from Redshift.

Data Engineering

766 Views
1 replies
0 kudos

06-09-2021 1:28:20 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-21-2021 6:02:06 PM

0 kudos

The Spark driver for Redshift pushes the following operators down into Redshift:FilterProjectSortLimitAggregationJoinHowever, it does not support expressions operating on dates and timestamps today. If you have a similar requirement, please add a fea...

0 kudos

06-21-2021 6:02:06 PM

by sajith_appukutt • Honored Contributor II

06-09-2021 12:07:20 AM

1294 Views
1 replies
0 kudos

Resolved! Re-optimize in delta not splitting large files to smaller files.

I am trying to re-optimize the a delta table with a max file size of 32 MB. But after changing spark.databricks.delta.optimize.maxFileSize and trying to optimize a partition, it doesn't split larger files to smaller ones. How can i get it to work.

Data Engineering

1294 Views
1 replies
0 kudos

06-09-2021 12:07:20 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-21-2021 3:21:11 PM

0 kudos

spark.databricks.delta.optimize.maxFileSize controls the target size to binpack files when you run OPTIMIZE command. But it will not split larger files to smaller ones today. File splitting happens when ZORDER is ran however.

0 kudos

06-21-2021 3:21:11 PM

by Srikanth_Gupta_ • Valued Contributor

06-14-2021 3:15:21 PM

658 Views
2 replies
1 kudos

What are Best Practices for Spark streaming in Databricks

What are best practices for Spark streaming in Databricksis it good idea to consume multiple topics in one streaming jobis Auto scaling recommended for spark streamingHow many worker nodes we should choose for streaming jobWhen should we run OPTIMIZE...

Data Engineering

658 Views
2 replies
1 kudos

06-14-2021 3:15:21 PM

View Replies

Latest Reply

craig_ng
New Contributor III

06-18-2021 10:37:30 AM

1 kudos

See our docs for other considerations when deploying a production streaming job.

1 kudos

06-18-2021 10:37:30 AM

1 More Replies

by User16826992666 • Valued Contributor

06-16-2021 8:41:10 PM

664 Views
1 replies
0 kudos

Resolved! I know my partitions are skewed, is there anything I can do to help my performance?

I know the skew in my dataset has the potential to cause issues with my job performance, so just wondering if there is anything I can do to help my performance other than repartitioning the whole dataset.

Data Engineering

664 Views
1 replies
0 kudos

06-16-2021 8:41:10 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-17-2021 11:16:06 PM

0 kudos

For scenarios like this, it is recommend to use a cluster with Databricks Runtime 7.3 LTS or above where AQE is enabled. AQE dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed) skewed tasks into ...

0 kudos

06-17-2021 11:16:06 PM

by User16826992666 • Valued Contributor

06-15-2021 8:34:28 PM

1117 Views
1 replies
0 kudos

Resolved! What should I be looking for when evaluating the performance of a Spark job?

Where do I start when starting performance tuning of my queries? Are there particular things I should be looking out for?

Data Engineering

1117 Views
1 replies
0 kudos

06-15-2021 8:34:28 PM

View Replies

Latest Reply

Srikanth_Gupta_
Valued Contributor

06-16-2021 5:35:48 AM

0 kudos

Few things on top of my mind.1) Check Spark UI and check which stage is taking more time.2) Check for data skewing3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join h...

0 kudos

06-16-2021 5:35:48 AM

by Srikanth_Gupta_ • Valued Contributor

06-15-2021 5:03:02 AM

616 Views
1 replies
0 kudos

How does Spark SQL Catalyst optimizer work?

How does Catalyst optimizer improves the performances, what is its role?

Data Engineering

616 Views
1 replies
0 kudos

06-15-2021 5:03:02 AM

View Replies

Latest Reply

Srikanth_Gupta_
Valued Contributor

06-15-2021 5:03:55 AM

0 kudos

Catalyst optimizer converts unresolved logical plan into executable physical plan, deep dive is available here

0 kudos

06-15-2021 5:03:55 AM

by User16790091296 • Contributor II

06-04-2021 11:52:03 AM

455 Views
0 replies
5 kudos

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia): [Note: This list is not exhaustive] Leverage the DataFrame or Spar...

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia):[Note: This list is not exhaustive]Leverage the DataFrame or SparkSQL API’s first. They use the same execution process resulting in parity in performance but they also com...

Data Engineering

455 Views
0 replies
5 kudos

06-04-2021 11:52:03 AM

by SatheesshChinnu • New Contributor III

02-11-2017 5:34:17 PM

8147 Views
4 replies
0 kudos

Resolved! Error: TransportResponseHandler: Still have 1 requests outstanding when connection, occurring only on large dataset.

I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error. Look like it fails on shuffle stage. Approx number of mappers is 150,000 Spark config:spark.sql.warehouse.dir hdfs:...

Data Engineering

8147 Views
4 replies
0 kudos

02-11-2017 5:34:17 PM

View Replies

Latest Reply

parikshitbhoyar
New Contributor II

09-03-2018 2:20:26 AM

0 kudos

@Satheessh Chinnusamy how did you solve the above issue

0 kudos

09-03-2018 2:20:26 AM

3 More Replies

by WajdiFATHALLAH • New Contributor

05-18-2017 2:18:23 AM

8890 Views
4 replies
0 kudos

Writing large parquet file (500 millions row / 1000 columns) to S3 takes too much time

Hello community,First let me introduce my use case, i daily receive a 500 million rows like so :ID | Categories 1 | cat1, cat2, cat3, ..., catn 2 | cat1, catx, caty, ..., anothercategory Input data: 50 compressed csv files each file is 250 MB ...

Data Engineering

8890 Views
4 replies
0 kudos

05-18-2017 2:18:23 AM

View Replies

Latest Reply

EliasHaydar
New Contributor II

08-13-2018 5:16:32 AM

0 kudos

So you are basically creating an inverted index ?

0 kudos

08-13-2018 5:16:32 AM

3 More Replies

by User16826991422 • Contributor

02-10-2016 10:07:06 AM

5351 Views
5 replies
0 kudos

Resolved! How do I get a cartesian product of a huge dataset?

A cartesian product is a common operation to get the cross product of two tables. For example, say you have a list of customers and a list of your product catalog and want to get the cross product of all customer - product combinations. Cartesian pr...

Data Engineering

5351 Views
5 replies
0 kudos

02-10-2016 10:07:06 AM

View Replies

Latest Reply

Forum_Admin
Contributor

05-10-2018 2:12:21 AM

0 kudos

Hi buddies, it is great written piece entirely defined, continue the good work constantly.

0 kudos

05-10-2018 2:12:21 AM

4 More Replies