Databricks Community

nolanlavender00 · ‎02-10-2023

I am using Autoloader to load files from a directory. I have set up File Notification with the Event Subscription.

I have a backfill interval set to 1 day and have not run the stream for a week. There should only be about ~100 new files to pick up and the stage states it completes in the Spark UI.

However, the job does not write and stalls for a long time. Then does not complete the write over. When going to the Driver Logs, I see messages like this.

2023-02-10T18:35:04.867+0000: [GC (Heap Inspection Initiated GC) [PSYoungGen: 2625154K->11041K(15486464K)] 2861020K->246915K(46883840K), 0.0116171 secs] [Times: user=0.09 sys=0.00, real=0.01 secs] 
2023-02-10T18:35:04.878+0000: [Full GC (Heap Inspection Initiated GC) [PSYoungGen: 11041K->0K(15486464K)] [ParOldGen: 235874K->231400K(31397376K)] 246915K->231400K(46883840K), [Metaspace: 291018K->291018K(313344K)], 0.1842356 secs] [Times: user=0.79 sys=0.00, real=0.18 secs]

about every 20 mins.

The job has been stalled for hours, I have tried increasing and decreasing the cluster.

I do not want to have to reset the checkpoint and start over.

Thanks

Anonymous · ‎04-09-2023

@nolanlavender008 :

It looks like the job is experiencing frequent garbage collection (GC), which can cause significant delays and affect the job's performance. In this case, it seems like the issue may be related to the size of the heap, which is the portion of memory where the JVM stores objects.

To resolve this issue, you may try the following steps:

Increase the size of the heap by adding the --conf spark.driver.memoryOverhead=*** and --conf spark.driver.memory=*** options to your spark-submit command, where *** is the amount of memory you want to allocate to the driver. For example, you could set the driver memory to 16 GB with an overhead of 2 GB by using the following command: --conf spark.driver.memoryOverhead=2g --conf spark.driver.memory=16g
.If increasing the driver memory does not resolve the issue, you can try tuning the garbage collection settings. You can add the following options to your spark-submit command to enable verbose GC logging and set the GC algorithm to G1GC:

--conf spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseG1GC
--conf spark.driver.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseG1GC

These options will print detailed information about garbage collection to the console, which can help you identify the root cause of the issue.

If neither of these options resolve the issue, you may need to optimize your code to reduce memory usage. This could involve restructuring your code to use more efficient algorithms, caching data in memory, or using more efficient data structures.

It is important to note that resetting the checkpoint and starting over may be necessary in some cases, particularly if the job has been stalled for an extended period of time. However, before taking this step, it is worth exploring other options to see if the issue can be resolved without starting over.

Anonymous · ‎04-10-2023

Hi @nolanlavender008

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Databricks Community

How to control garbage collection while using Autoloader File Notification?

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon