Hi,
We are trying to ingest zip files into Azure Databricks delta lake using COPY INTO command.
There are 100+ zip files with average size of ~300MB each.
Cluster configuration:
spark = SparkSession.builder.appName("YourApp").config("spark.sql.execution.arrow.enabled", "true").config("spark.sql.execution.arrow.maxRecordsPerBatch", "100").config("spark.databricks.io.cache.maxFileSize", "2G").config("spark.network.timeout", "1000s").config("spark.driver.maxResultSize","2G").getOrCreate()
We are consistently getting the following error while trying to ingest the zip files:
Job aborted due to stage failure: Task 77 in stage 33.0 failed 4 times, most recent failure: Lost task 77.3 in stage 33.0 (TID 1667) (10.139.64.12 executor 20): ExecutorLostFailure (executor 20 exited caused by one of the running tasks) Reason: Command exited with code 50 The error stack looks like this:Py4JJavaError: An error occurred while calling o360.sql. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 77 in stage 33.0 failed 4 times, most recent failure: Lost task 77.3 in stage 33.0 (TID 1667) (10.139.64.12 executor 20): ExecutorLostFailure (executor 20 exited caused by one of the running tasks) Reason: Command exited with code 50
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3628) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3559) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3546) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3546) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1521) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1521) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1521) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3875) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3787) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3775) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51) at org.apache.spark.scheduler.DAGScheduler.$anonfun$runJob$1(DAGScheduler.scala:1245) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV
This works for less number of zip files (upto 20). Even this was not working with default cluster configuration. We had to increase driver and worker config and increase parallelism and executor memory options at cluster level as mentioned above. Now this higher config is failing when trying to ingest more zip files. We ideally don't wish to increase the cluster config any further as that's not the optimal solution and number of files can keep increasing.
Please advise.
CC: @Anup