Databricks Community

nikhilmb · ‎03-06-2024

Hi,

We are trying to ingest zip files into Azure Databricks delta lake using COPY INTO command.

There are 100+ zip files with average size of ~300MB each.

Cluster configuration:

1 driver: 56GB, 16 cores
2-8 workers: 32GB, 8 cores (each). Autoscaling enabled.
Following Spark parameters set at cluster level:
- spark.default.parallelism 150
- spark.executor.memory 30g
  Following Spark parameters set at the notebook level (while running the COPY INTO command).

spark = SparkSession.builder.appName("YourApp").config("spark.sql.execution.arrow.enabled", "true").config("spark.sql.execution.arrow.maxRecordsPerBatch", "100").config("spark.databricks.io.cache.maxFileSize", "2G").config("spark.network.timeout", "1000s").config("spark.driver.maxResultSize","2G").getOrCreate()

We are consistently getting the following error while trying to ingest the zip files:

Job aborted due to stage failure: Task 77 in stage 33.0 failed 4 times, most recent failure: Lost task 77.3 in stage 33.0 (TID 1667) (10.139.64.12 executor 20): ExecutorLostFailure (executor 20 exited caused by one of the running tasks) Reason: Command exited with code 50 The error stack looks like this:Py4JJavaError: An error occurred while calling o360.sql. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 77 in stage 33.0 failed 4 times, most recent failure: Lost task 77.3 in stage 33.0 (TID 1667) (10.139.64.12 executor 20): ExecutorLostFailure (executor 20 exited caused by one of the running tasks) Reason: Command exited with code 50

Driver stacktrace: at

org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3628) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3559) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3546) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3546) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1521) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1521) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1521) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3875) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3787) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3775) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51) at org.apache.spark.scheduler.DAGScheduler.$anonfun$runJob$1(DAGScheduler.scala:1245) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV

This works for less number of zip files (upto 20). Even this was not working with default cluster configuration. We had to increase driver and worker config and increase parallelism and executor memory options at cluster level as mentioned above. Now this higher config is failing when trying to ingest more zip files. We ideally don't wish to increase the cluster config any further as that's not the optimal solution and number of files can keep increasing.

Please advise.

CC: @Anup

nikhilmb · ‎03-11-2024

Thanks for the response.

We tried all the suggestions in the post. It's still failing.

I think Spark tries to unzip files during ingestion and that's where it goes out of memory. May be ingesting zip files is not supported yet. We are now exploring the Unity Catalog Volume option to ingest zip files and access them in the delta lake.

nikhilmb · ‎03-14-2024

Just in the hope that this might benefit other users, we have decided to go for the good-old way of mounting cloud object store onto DBFS and then ingesting data from mounted drive into Unity Catalog-managed volume. Tried this for the 500+ zip files and it is working as expected.

nikhilmb · ‎04-10-2024

Although we were able to copy the zip files onto the DB volume, we were not able to share them with any system outside of the Databricks environment. Guess delta sharing does not support sharing files that are on UC volumes.

Databricks Community

Error ingesting zip files: ExecutorLostFailure Reason: Command exited with code 50

Connect with Databricks Users in Your Area

South Florida Databricks User Group: Accelerate Projects to Value with GenAI

Struggling with BI? We want to hear from you!

Introducing Databricks Apps

Databricks Community Champion - September 2024 - Szymon Dybczak

Intelligent Data Engineering: Beyond the AI Hype