Tuesday
I'm getting the following SparkOutOfMemoryError message while reading a 500mb json file, see below. I'm loading four csv files (around 150mb per file) and the json file in the same pipeline. When I load the json file alone it reads fine, same when I load everything with a cluster.
Anyone has an idea how to tweak serverless to read the json while processing the csv files?
Job aborted due to stage failure: Task 0 in stage 153.0 failed 4 times, most recent failure: Lost task 0.3 in stage 153.0 (TID 551) (10.46.122.241 executor 0): org.apache.spark.memory.SparkOutOfMemoryError: Photon ran out of memory while executing this query.
Photon failed to reserve 768.0 MiB for simdjson internal usage, in SimdJsonReader, in JsonFileScanNode(id=8883, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]), in task.
Memory usage:
Total task memory (including non-Photon): 1152.0 MiB
task: allocated 262.1 MiB, tracked 1152.0 MiB, untracked allocated 0.0 B, peak 1152.0 MiB
BufferPool: allocated 6.1 MiB, tracked 128.0 MiB, untracked allocated 0.0 B, peak 128.0 MiB
DataWriter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Photon Protobuf Plan Arena: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 110.8 KiB
JsonFileScanNode(id=8883, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]): allocated 256.0 MiB, tracked 1024.0 MiB, untracked allocated 0.0 B, peak 1024.0 MiB
JniReader: allocated 1984.0 B, tracked 1984.0 B, untracked allocated 0.0 B, peak 1984.0 B
SimdJsonReader: allocated 256.0 MiB, tracked 1024.0 MiB, untracked allocated 0.0 B, peak 1024.0 MiB
JSON buffer: allocated 256.0 MiB, tracked 256.0 MiB, untracked allocated 0.0 B, peak 256.0 MiB
simdjson internal usage: allocated 0.0 B, tracked 768.0 MiB, untracked allocated 0.0 B, peak 768.0 MiB
ProjectNode(id=8893, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
ProjectNode(id=8908, output_schema=[string, struct<string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 4 more>]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
SortNode(id=8911, output_schema=[string, struct<string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 4 more>]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Sorter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
spilled run buffers: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
output batch var len data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Memory consumers:
Acquired by com.databricks.photon.NativeMemoryConsumer@9cc6126: 1152.0 MiB
at 0xbca6493 <photon>.CreateReservationError(external/workspace_spark_3_5/photon/common/memory-tracker.cc:561)
at 0xbca51c7 <photon>.GrowBuffer(external/workspace_spark_3_5/photon/io/json/simd-json-reader.cc:295)
at 0x77b6d5f <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_5/photon/io/json/simd-json-reader.cc:313)
at 0x77b70e3 <photon>.HasNext(external/workspace_spark_3_5/photon/io/json/simd-json-reader.cc:365)
at 0x6e5444b <photon>.ReaderHasNext(external/workspace_spark_3_5/photon/exec-nodes/common-file-scan-node.h:139)
at 0x6e5405b <photon>.HasNextImpl(external/workspace_spark_3_5/photon/exec-nodes/json-file-scan-node.cc:121)
at 0x6d7c5e7 <photon>.OpenImpl(external/workspace_spark_3_5/photon/exec-nodes/sort-node.cc:140)
at com.databricks.photon.JniApiImpl.open(Native Method)
at com.databricks.photon.JniApi.open(JniApi.scala)
at com.databricks.photon.JniExecNode.open(JniExecNode.java:73)
at com.databricks.photon.PhotonColumnarBatchResultHandler.$anonfun$getResult$4(PhotonExec.scala:1224)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.photon.PhotonResultHandler.timeit(PhotonResultHandler.scala:30)
at com.databricks.photon.PhotonResultHandler.timeit$(PhotonResultHandler.scala:28)
at com.databricks.photon.PhotonColumnarBatchResultHandler.timeit(PhotonExec.scala:1216)
at com.databricks.photon.PhotonColumnarBatchResultHandler.getResult(PhotonExec.scala:1224)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.open(PhotonBasicEvaluatorFactory.scala:252)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.hasNextImpl(PhotonBasicEvaluatorFactory.scala:257)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.$anonfun$hasNext$1(PhotonBasicEvaluatorFactory.scala:275)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.photon.metrics.BillableTimeTaskMetrics.withPhotonBilling(BillableTimeTaskMetrics.scala:71)
at org.apache.spark.TaskContext.runFuncAsBillable(TaskContext.scala:267)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.hasNext(PhotonBasicEvaluatorFactory.scala:275)
at com.databricks.photon.CloseableIterator$$anon$10.hasNext(CloseableIterator.scala:211)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at org.apache.spark.sql.execution.aggregate.SortAggregateExec.$anonfun$doExecute$1(SortAggregateExec.scala:67)
at org.apache.spark.sql.execution.aggregate.SortAggregateExec.$anonfun$doExecute$1$adapted(SortAggregateExec.scala:64)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:932)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:932)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:420)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:417)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:384)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:420)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:417)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:384)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:83)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:39)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:227)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:204)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:166)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
at scala.util.Using$.resource(Using.scala:269)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:160)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:105)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$11(Executor.scala:1227)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:112)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:1231)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:1083)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Driver stacktrace:
Photon ran out of memory while executing this query.
Photon failed to reserve 768.0 MiB for simdjson internal usage, in SimdJsonReader, in JsonFileScanNode(id=8883, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]), in task.
Memory usage:
Total task memory (including non-Photon): 1152.0 MiB
task: allocated 262.1 MiB, tracked 1152.0 MiB, untracked allocated 0.0 B, peak 1152.0 MiB
BufferPool: allocated 6.1 MiB, tracked 128.0 MiB, untracked allocated 0.0 B, peak 128.0 MiB
DataWriter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Photon Protobuf Plan Arena: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 110.8 KiB
JsonFileScanNode(id=8883, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]): allocated 256.0 MiB, tracked 1024.0 MiB, untracked allocated 0.0 B, peak 1024.0 MiB
JniReader: allocated 1984.0 B, tracked 1984.0 B, untracked allocated 0.0 B, peak 1984.0 B
SimdJsonReader: allocated 256.0 MiB, tracked 1024.0 MiB, untracked allocated 0.0 B, peak 1024.0 MiB
JSON buffer: allocated 256.0 MiB, tracked 256.0 MiB, untracked allocated 0.0 B, peak 256.0 MiB
simdjson internal usage: allocated 0.0 B, tracked 768.0 MiB, untracked allocated 0.0 B, peak 768.0 MiB
ProjectNode(id=8893, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
ProjectNode(id=8908, output_schema=[string, struct<string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 4 more>]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
SortNode(id=8911, output_schema=[string, struct<string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 4 more>]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Sorter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
spilled run buffers: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
output batch var len data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Memory consumers:
Acquired by com.databricks.photon.NativeMemoryConsumer@9cc6126: 1152.0 MiB
Tuesday
Hi @LarsMewa
When using serverless, we will not be able to upgrade the size of the executor and the driver. Are you facing issue when processing the json and the csv file together?
The error indicates serverless resource constraint: serverless use limited off-heap memory for Photon. When you load both CSV + JSON, Photon has less room left to allocate its big parsing buffer.
Kindly let me know if you have any questions on this
Wednesday
This fixed it:
As a quick workaround to address out-of-memory errors when processing large JSON files in Databricks serverless pipelines, we recommend disabling the Photon JSON Scan. The Photon engine is optimized for performance, but scanning large JSON files with it can use up to 7x the raw file size in memory.
Try to disable Photon JSON Scan by adding this configuration to your pipeline or notebook:
Tuesday
Hi @LarsMewa ,
If you could try increasing your driver and executor memory and try defining the schema explicitly instead of inferring it.
Tuesday
Hi @jayanta1 ,
can you guide me how to do that?
Tuesday
Hi @LarsMewa
When using serverless, we will not be able to upgrade the size of the executor and the driver. Are you facing issue when processing the json and the csv file together?
The error indicates serverless resource constraint: serverless use limited off-heap memory for Photon. When you load both CSV + JSON, Photon has less room left to allocate its big parsing buffer.
Kindly let me know if you have any questions on this
Wednesday
This fixed it:
As a quick workaround to address out-of-memory errors when processing large JSON files in Databricks serverless pipelines, we recommend disabling the Photon JSON Scan. The Photon engine is optimized for performance, but scanning large JSON files with it can use up to 7x the raw file size in memory.
Try to disable Photon JSON Scan by adding this configuration to your pipeline or notebook:
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now