Databricks Community

Gopal269673 · ‎03-31-2023

Hi All.. Need your help in this issue what i am facing. Currently we are using data bricks as a platform to build pipeline and execute our talend ETL sqls converted into the spark sql framework as we were facing issues in loading the history data into the platform of customer.

During this process we see good results for dimensional loads but coming to few facts we are facing lot of issues in execution. Even after changing the node configuration to scalable upto 8 worker nodes using i3xlarge 30.5gb 4 cores each node to execute it.

However, I see that few queries are running more than 4 hours and the current query is not at all running from 6 hours even though the full nodes getting launched.

The row count in the table i see as 4337765617. I am attaching the query and also few snapshot of logs for your quick reference and guidance here to come out of this problem. Appreciate your quick help in this regard.

Regards,

Gopal

Gopal269673 · ‎04-06-2023

@All Users Group Hi All.. we had tried several options of tuning the query by selecting required variables in the select and subsequent clauses. I see other queries are little ok to run. But the attached query seems not able to run from past 6 hours with 8 worker nodes config. I see that spill is high and attached the metrics for it. Anyone can suggest optimization techniques in python note book for looking into it as I am getting only scala related programs. Please help in optimization best methods guide and material more specific to Pyspark & Sql.

View solution in original post

Ajay-Pandey · ‎04-01-2023

@Gopal B ,

Have you tried by increasing the worker and memory ?

Ajay Kumar Pandey

Gopal269673 · ‎04-01-2023

@Ajay Pandey Yes initially it was running in 4 workers ..now after scaling upto 8 workers also from past 6 hours i dont see any movement and it was telling filtering as per screen shot. Not sure to tune nothing is there in the query other than insert and coalesce function for few variables. Need quick help from tuning it and as well debugging the issue for optimization parameters

Gopal269673 · ‎04-01-2023

@All Users Group Please see whether anyone can help here... As i see after photon acceleration enablement also didnt see much success here. However same query in AWS arora took 8 hours to complete with 100% data ..but here it is unable to execute with 10% of data..please suggest and help

pvignesh92 · ‎04-03-2023

@Gopal B Hi. I see that there is a row number and a select on another table to populate a column. Did you execute that part alone and see how much time that takes on your 10% data?

Also I see that you are using row_number function to just generate a incremental id, is it possible you can achieve the same with GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ] ] ? These might avoid a full table scan to populate your row_number.

Gopal269673 · ‎04-03-2023

@Vigneshraja Palaniraj Thanks for the review of my query and providing your view for it. Individual query will run and see how much time it takes. Meanwhile alternative for row_number to create identity. We can't be able to create identity here. Any other suggestion please help here. Do you think any other optimization will help? I tried suing photon also but not really useful and execution stuck still so terminated it.

pvignesh92 · ‎04-03-2023

Hi @Gopal B Photon can't be effective in all the cases. It will be of good effect when you have good amount of joins and aggregations happening in your logic. But in your case, it seems to be a simple read and write.

Also see if you can reduce the number of workers and have one or two large cluster with more cores and memory like 4XL, so that you can avoid shuffles. But Please see at which point, you are taking time as I see from your job log, only one task is running.

Gopal269673 · ‎04-03-2023

@Vigneshraja Palaniraj yes i agree with you photon is not useful here. As per your suggestion will try to see what can be done by enabling memory like cluster nodes for this part. As of now i am checking query tuning and observing any data elements can be reduced here in selection based on business approval. However, it is lengthy process and want to avoid it. In the job log only one task is running and that is why i am surprised. The no of nodes didnt even helped in this part of code

pvignesh92 · ‎04-03-2023

Yes. Because the row number with order by functionality have to bring all the records to a single worker so that it can assign a running id to your data sequentially. So only after this operation is completed, the write can happen in parallel depending on your partitions. You can also consider ordering your input 'installment' table while writing itself on the column that you are ordering and see if it can bring down your timings.

Gopal269673 · ‎04-03-2023

@Vigneshraja Palaniraj Sure Vignesh . Thanks a for your review and inputs. let me try this. Currently shared cluster is used by users and need to plan in sometime where jobs are not planned to align it to single node memory one. Any suggestable memory instance will suite here based on your experience?

Coming to installment table it is around 32Gb and i am wondering that ordering will take how long. I will add it my list of tasks to perform and see as option.

Gopal269673 · ‎04-06-2023

@All Users Group Hi All.. we had tried several options of tuning the query by selecting required variables in the select and subsequent clauses. I see other queries are little ok to run. But the attached query seems not able to run from past 6 hours with 8 worker nodes config. I see that spill is high and attached the metrics for it. Anyone can suggest optimization techniques in python note book for looking into it as I am getting only scala related programs. Please help in optimization best methods guide and material more specific to Pyspark & Sql.