cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Jobs & Pipelines: Serverless SparkOutOfMemoryError while reading 500mb json file

LarsMewa
New Contributor III

I'm getting the following SparkOutOfMemoryError message while reading a 500mb json file, see below. I'm loading four csv files (around 150mb per file) and the json file in the same pipeline. When I load the json file alone it reads fine, same when I load everything with a cluster. 

Anyone has an idea how to tweak serverless to read the json while processing the csv files?

Job aborted due to stage failure: Task 0 in stage 153.0 failed 4 times, most recent failure: Lost task 0.3 in stage 153.0 (TID 551) (10.46.122.241 executor 0): org.apache.spark.memory.SparkOutOfMemoryError: Photon ran out of memory while executing this query.
Photon failed to reserve 768.0 MiB for simdjson internal usage, in SimdJsonReader, in JsonFileScanNode(id=8883, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]), in task.
Memory usage:
Total task memory (including non-Photon): 1152.0 MiB
task: allocated 262.1 MiB, tracked 1152.0 MiB, untracked allocated 0.0 B, peak 1152.0 MiB
BufferPool: allocated 6.1 MiB, tracked 128.0 MiB, untracked allocated 0.0 B, peak 128.0 MiB
DataWriter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Photon Protobuf Plan Arena: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 110.8 KiB
JsonFileScanNode(id=8883, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]): allocated 256.0 MiB, tracked 1024.0 MiB, untracked allocated 0.0 B, peak 1024.0 MiB
JniReader: allocated 1984.0 B, tracked 1984.0 B, untracked allocated 0.0 B, peak 1984.0 B
SimdJsonReader: allocated 256.0 MiB, tracked 1024.0 MiB, untracked allocated 0.0 B, peak 1024.0 MiB
JSON buffer: allocated 256.0 MiB, tracked 256.0 MiB, untracked allocated 0.0 B, peak 256.0 MiB
simdjson internal usage: allocated 0.0 B, tracked 768.0 MiB, untracked allocated 0.0 B, peak 768.0 MiB
ProjectNode(id=8893, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
ProjectNode(id=8908, output_schema=[string, struct<string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 4 more>]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
SortNode(id=8911, output_schema=[string, struct<string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 4 more>]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Sorter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
spilled run buffers: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
output batch var len data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Memory consumers:
Acquired by com.databricks.photon.NativeMemoryConsumer@9cc6126: 1152.0 MiB

at 0xbca6493 <photon>.CreateReservationError(external/workspace_spark_3_5/photon/common/memory-tracker.cc:561)
at 0xbca51c7 <photon>.GrowBuffer(external/workspace_spark_3_5/photon/io/json/simd-json-reader.cc:295)
at 0x77b6d5f <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_5/photon/io/json/simd-json-reader.cc:313)
at 0x77b70e3 <photon>.HasNext(external/workspace_spark_3_5/photon/io/json/simd-json-reader.cc:365)
at 0x6e5444b <photon>.ReaderHasNext(external/workspace_spark_3_5/photon/exec-nodes/common-file-scan-node.h:139)
at 0x6e5405b <photon>.HasNextImpl(external/workspace_spark_3_5/photon/exec-nodes/json-file-scan-node.cc:121)
at 0x6d7c5e7 <photon>.OpenImpl(external/workspace_spark_3_5/photon/exec-nodes/sort-node.cc:140)
at com.databricks.photon.JniApiImpl.open(Native Method)
at com.databricks.photon.JniApi.open(JniApi.scala)
at com.databricks.photon.JniExecNode.open(JniExecNode.java:73)
at com.databricks.photon.PhotonColumnarBatchResultHandler.$anonfun$getResult$4(PhotonExec.scala:1224)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.photon.PhotonResultHandler.timeit(PhotonResultHandler.scala:30)
at com.databricks.photon.PhotonResultHandler.timeit$(PhotonResultHandler.scala:28)
at com.databricks.photon.PhotonColumnarBatchResultHandler.timeit(PhotonExec.scala:1216)
at com.databricks.photon.PhotonColumnarBatchResultHandler.getResult(PhotonExec.scala:1224)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.open(PhotonBasicEvaluatorFactory.scala:252)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.hasNextImpl(PhotonBasicEvaluatorFactory.scala:257)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.$anonfun$hasNext$1(PhotonBasicEvaluatorFactory.scala:275)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.photon.metrics.BillableTimeTaskMetrics.withPhotonBilling(BillableTimeTaskMetrics.scala:71)
at org.apache.spark.TaskContext.runFuncAsBillable(TaskContext.scala:267)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.hasNext(PhotonBasicEvaluatorFactory.scala:275)
at com.databricks.photon.CloseableIterator$$anon$10.hasNext(CloseableIterator.scala:211)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at org.apache.spark.sql.execution.aggregate.SortAggregateExec.$anonfun$doExecute$1(SortAggregateExec.scala:67)
at org.apache.spark.sql.execution.aggregate.SortAggregateExec.$anonfun$doExecute$1$adapted(SortAggregateExec.scala:64)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:932)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:932)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:420)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:417)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:384)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:420)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:417)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:384)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:83)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:39)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:227)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:204)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:166)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
at scala.util.Using$.resource(Using.scala:269)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:160)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:105)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$11(Executor.scala:1227)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:112)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:1231)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:1083)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)

Driver stacktrace:
Photon ran out of memory while executing this query.
Photon failed to reserve 768.0 MiB for simdjson internal usage, in SimdJsonReader, in JsonFileScanNode(id=8883, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]), in task.
Memory usage:
Total task memory (including non-Photon): 1152.0 MiB
task: allocated 262.1 MiB, tracked 1152.0 MiB, untracked allocated 0.0 B, peak 1152.0 MiB
BufferPool: allocated 6.1 MiB, tracked 128.0 MiB, untracked allocated 0.0 B, peak 128.0 MiB
DataWriter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Photon Protobuf Plan Arena: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 110.8 KiB
JsonFileScanNode(id=8883, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]): allocated 256.0 MiB, tracked 1024.0 MiB, untracked allocated 0.0 B, peak 1024.0 MiB
JniReader: allocated 1984.0 B, tracked 1984.0 B, untracked allocated 0.0 B, peak 1984.0 B
SimdJsonReader: allocated 256.0 MiB, tracked 1024.0 MiB, untracked allocated 0.0 B, peak 1024.0 MiB
JSON buffer: allocated 256.0 MiB, tracked 256.0 MiB, untracked allocated 0.0 B, peak 256.0 MiB
simdjson internal usage: allocated 0.0 B, tracked 768.0 MiB, untracked allocated 0.0 B, peak 768.0 MiB
ProjectNode(id=8893, output_schema=[string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 3 more]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
ProjectNode(id=8908, output_schema=[string, struct<string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 4 more>]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
SortNode(id=8911, output_schema=[string, struct<string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, string, ... 4 more>]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Sorter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
spilled run buffers: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
output batch var len data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
Memory consumers:
Acquired by com.databricks.photon.NativeMemoryConsumer@9cc6126: 1152.0 MiB

2 ACCEPTED SOLUTIONS

Accepted Solutions

Saritha_S
Databricks Employee
Databricks Employee

Hi @LarsMewa 

When using serverless, we will not be able to upgrade the size of the executor and the driver. Are you facing issue when processing the json and the csv file together? 

The error indicates serverless resource constraint: serverless use limited off-heap memory for Photon. When you load both CSV + JSON, Photon has less room left to allocate its big parsing buffer.

Kindly let me know if you have any questions on this

View solution in original post

LarsMewa
New Contributor III

This fixed it:

As a quick workaround to address out-of-memory errors when processing large JSON files in Databricks serverless pipelines, we recommend disabling the Photon JSON Scan. The Photon engine is optimized for performance, but scanning large JSON files with it can use up to 7x the raw file size in memory.

Try to disable Photon JSON Scan by adding this configuration to your pipeline or notebook:

 
set spark.databricks.photon.jsonScan.enabled=false
 
This forces the engine to use the standard Spark JSON reader, which uses far less memory. There may be a slight performance trade-off, but it is the most reliable solution for large JSON datasets in serverless workflows.

View solution in original post

4 REPLIES 4

jayanta1
New Contributor

Hi @LarsMewa ,

If you could try increasing your driver and executor memory and try defining the schema explicitly instead of inferring it. 

LarsMewa
New Contributor III

Hi @jayanta1 , 

can you guide me how to do that?

Saritha_S
Databricks Employee
Databricks Employee

Hi @LarsMewa 

When using serverless, we will not be able to upgrade the size of the executor and the driver. Are you facing issue when processing the json and the csv file together? 

The error indicates serverless resource constraint: serverless use limited off-heap memory for Photon. When you load both CSV + JSON, Photon has less room left to allocate its big parsing buffer.

Kindly let me know if you have any questions on this

LarsMewa
New Contributor III

This fixed it:

As a quick workaround to address out-of-memory errors when processing large JSON files in Databricks serverless pipelines, we recommend disabling the Photon JSON Scan. The Photon engine is optimized for performance, but scanning large JSON files with it can use up to 7x the raw file size in memory.

Try to disable Photon JSON Scan by adding this configuration to your pipeline or notebook:

 
set spark.databricks.photon.jsonScan.enabled=false
 
This forces the engine to use the standard Spark JSON reader, which uses far less memory. There may be a slight performance trade-off, but it is the most reliable solution for large JSON datasets in serverless workflows.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now