Databricks

alejandrofm · ‎03-04-2023

Hi, I have a Pyspark job that takes about an hour to complete, when looking at the SQL tab on Spark UI I see this:

Those processes run for more than 1 minute on a 60-minute process.

This is Ganglia for that period (the last snapshot, will look into a live run for the last part)I enter via spark UI to task 18 on SQL and this is what I see:

And the details, fields, and database names were replaced by placeholders or "..." for compliance purposes

== Physical Plan ==
AppendDataExecV1 (1)
 
 
(1) AppendDataExecV1
Arguments: [num_affected_rows#1348L, num_inserted_rows#1349L], DeltaTableV2(org.apache.spark.sql.SparkSession@7ecdf898,dbfs:/mnt/eterlake/...../...,Some(CatalogTable(
Database: database
Table: table
Owner: (Basic token.....
Created Time: Sat Jul 13 16:06:20 UTC 2019
Last Access: UNKNOWN
Created By: Spark 2.4.0
Type: EXTERNAL
Provider: DELTA
Table Properties: [delta.lastCommitTimestamp=1662525805000, delta.lastUpdateVersion=8134, delta.minReaderVersion=1, delta.minWriterVersion=2]
Statistics: 0 bytes, 6260684735 rows
Location: dbfs:/mnt/.../location/...
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Schema: root
......
.....
.....
)),Some(spark_catalog.......),None,Map(),org.apache.spark.sql.util.CaseInsensitiveStringMap@1f), Project [... 26 more fields], org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$8007/1446072698@7a714f29, com.databricks.sql.transaction.tahoe.catalog.WriteIntoDeltaBuilder$$anon$1@1df0da7e

Do you see something that could be improved here?

Thanks!!!

daniel_sahal · ‎03-05-2023

@Alejandro Martinez

I would recommend you to go through this video:

https://www.youtube.com/watch?v=daXEp4HmS-E

Especially look through partitions, data skew, spills.

Also IMO the utilization (avg load) should be around 70%. Try to optimize your workload a little bit.

alejandrofm · ‎03-06-2023

Will look into that! thanks, really is a very simple process, the regex seems to be what is taking more time, that and the AppendDataExecV1. This is the other task that takes 38 minutes. The logic of the regex is this

dataframe = self.spark.read \

.text(source_files_path) \

.withColumn('source_file', source_file_derivation)

where source_file_derivation is:

source_file_derivation = regexp_replace(reverse(split(reverse(input_file_name()), '/')[0]), '%23', '#')

To add the filename on a column of the data frame (we read multiple files).

Thanks!

Debayan · ‎03-07-2023

Hi,

When you say it is taking a lot of time, was there a situation where this was running earlier than the time taking now?

Also, approximately how much amount of data is getting processes with this JOB?

Is it consistently taking this much time?

Could you also please confirm about the cluster configuration (also the DBR version?) this is running on?

Also please tag @Debayan with your next response which will notify me, Thank you!

Vartika · ‎03-30-2023

Hi @Alejandro Martinez

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!