Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Showing results for 
Search instead for 
Did you mean: 

How to control garbage collection while using Autoloader File Notification?

New Contributor

I am using Autoloader to load files from a directory. I have set up File Notification with the Event Subscription.

I have a backfill interval set to 1 day and have not run the stream for a week. There should only be about ~100 new files to pick up and the stage states it completes in the Spark UI.

However, the job does not write and stalls for a long time. Then does not complete the write over. When going to the Driver Logs, I see messages like this.

2023-02-10T18:35:04.867+0000: [GC (Heap Inspection Initiated GC) [PSYoungGen: 2625154K->11041K(15486464K)] 2861020K->246915K(46883840K), 0.0116171 secs] [Times: user=0.09 sys=0.00, real=0.01 secs] 
2023-02-10T18:35:04.878+0000: [Full GC (Heap Inspection Initiated GC) [PSYoungGen: 11041K->0K(15486464K)] [ParOldGen: 235874K->231400K(31397376K)] 246915K->231400K(46883840K), [Metaspace: 291018K->291018K(313344K)], 0.1842356 secs] [Times: user=0.79 sys=0.00, real=0.18 secs] 

about every 20 mins.

The job has been stalled for hours, I have tried increasing and decreasing the cluster.

I do not want to have to reset the checkpoint and start over.



Not applicable

@nolanlavender008​ :

It looks like the job is experiencing frequent garbage collection (GC), which can cause significant delays and affect the job's performance. In this case, it seems like the issue may be related to the size of the heap, which is the portion of memory where the JVM stores objects.

To resolve this issue, you may try the following steps:

  1. Increase the size of the heap by adding the --conf spark.driver.memoryOverhead=*** and --conf spark.driver.memory=*** options to your spark-submit command, where *** is the amount of memory you want to allocate to the driver. For example, you could set the driver memory to 16 GB with an overhead of 2 GB by using the following command: --conf spark.driver.memoryOverhead=2g --conf spark.driver.memory=16g
  2. .If increasing the driver memory does not resolve the issue, you can try tuning the garbage collection settings. You can add the following options to your spark-submit command to enable verbose GC logging and set the GC algorithm to G1GC:
--conf spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseG1GC
--conf spark.driver.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseG1GC

These options will print detailed information about garbage collection to the console, which can help you identify the root cause of the issue.

  1. If neither of these options resolve the issue, you may need to optimize your code to reduce memory usage. This could involve restructuring your code to use more efficient algorithms, caching data in memory, or using more efficient data structures.

It is important to note that resetting the checkpoint and starting over may be necessary in some cases, particularly if the job has been stalled for an extended period of time. However, before taking this step, it is worth exploring other options to see if the issue can be resolved without starting over.

Not applicable

Hi @nolanlavender008​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!