Hi, I have a Pyspark job that takes about an hour to complete, when looking at the SQL tab on Spark UI I see this:
Those processes run for more than 1 minute on a 60-minute process.
This is Ganglia for that period (the last snapshot, will look into a live run for the last part)I enter via spark UI to task 18 on SQL and this is what I see:
And the details, fields, and database names were replaced by placeholders or "..." for compliance purposes
== Physical Plan ==
AppendDataExecV1 (1)
(1) AppendDataExecV1
Arguments: [num_affected_rows#1348L, num_inserted_rows#1349L], DeltaTableV2(org.apache.spark.sql.SparkSession@7ecdf898,dbfs:/mnt/eterlake/...../...,Some(CatalogTable(
Database: database
Table: table
Owner: (Basic token.....
Created Time: Sat Jul 13 16:06:20 UTC 2019
Last Access: UNKNOWN
Created By: Spark 2.4.0
Type: EXTERNAL
Provider: DELTA
Table Properties: [delta.lastCommitTimestamp=1662525805000, delta.lastUpdateVersion=8134, delta.minReaderVersion=1, delta.minWriterVersion=2]
Statistics: 0 bytes, 6260684735 rows
Location: dbfs:/mnt/.../location/...
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Schema: root
......
.....
.....
)),Some(spark_catalog.......),None,Map(),org.apache.spark.sql.util.CaseInsensitiveStringMap@1f), Project [... 26 more fields], org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$8007/1446072698@7a714f29, com.databricks.sql.transaction.tahoe.catalog.WriteIntoDeltaBuilder$$anon$1@1df0da7e
Do you see something that could be improved here?
Thanks!!!