cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Notebook cell gets hung up but code completes

tim-mcwilliams
New Contributor III

Have been running into an issue when running a pymc-marketing model in a Databricks notebook. The cell that fits the model gets hung up and the progress bar stops moving, however the code completes and dumps all needed output into a folder. After the code completes I have to then detach the notebook since hitting Interrupt doesn't respond. I took a peek at the cluster logs and can confirm everything runs as expected (see screenshot!).

Any ideas the issue here or have you run into the same issue??

10 REPLIES 10

tim-mcwilliams
New Contributor III

Hi @Retired_mod

Thank you for replying with some trouble shooting steps. Really appreciate it! I've added some more context below in red.

  • First, ensure that your Databricks cluster has sufficient resources (CPU, memory, and disk space) to handle the model training. If the cluster is under-provisioned, it could lead to hanging cells.
  • Monitor resource utilization during the model execution to see if any resource limits are being reached.
    • Referencing this and the above here. At first we saw that the model fitting stage was using ~75% of the machines CPU. Figured that was the issue so we upgraded to a bigger machine (ML - Standard_D16as_v5 with 64DB of memory and 16 cores). However, this did not solve the issue. I feel like the bigger machine should be working since it's more powerful than my local machine (which the same code and data runs fine without errors). 
  • Double-check your pymc-marketing model code. Are there any infinite loops or blocking operations that might cause the cell to hang?
    • Doubled checked the code. There are not any for loops in there or blocking operations.
  • Verify that there are no unintentional waits or sleeps in your code that could lead to the progress bar stalling
    • No sleep or wait logic was added to the code
  • You mentioned that hitting โ€œInterruptโ€ doesnโ€™t respond. This behaviour could be related to the Databricks environment itself.
    • What about the Databricks env itself could be causing the behaviour?
  • Try using the โ€œRestartโ€ option for the notebook cell instead of โ€œInterruptโ€ to see if it releases the hang.
    • Will have to try this out. Does restarting clear the notebooks memory?
  • Ensure that there are no dependencies between cells that could cause a deadlock. For example, if one cell relies on the output of another cell, it might lead to a hang.
  • Check if any other cells are running concurrently and potentially interfering with the current cell.
    • There are no concurrent cells running.
  • Add additional logging statements to your code to track its progress. This can help identify where the execution hangs.
  • Use print() statements or Databricks-specific logging functions to capture relevant information during execution.
    • Referencing this and the above here, we have logging and print statements all throughout the code base. Everything works up until the progress bar starts, we have print and logging before model fitting. Then once the progress bar starts the cell hangs. We ever tried not displaying the progress bar and the process still hangs and shows the same behaviour.
  • Review the configuration of your Databricks cluster. Consider adjusting parameters such as the number of worker nodes, driver memory, and executor memory.
    • Will have to try this, thanks! We haven't tried adjusting the worker nodes as of yet.
  • Experiment with different cluster configurations to see if the issue persists.
    • Will have to try this, thanks! 
  • Ensure that your pymc-marketing library and any other dependencies are up to date. Outdated libraries can sometimes cause unexpected behavior.
    • All libs used are the current versions.
  • If possible, try running the same code in a local Python environment to see if the issue persists outside of Databricks
    • We have run the exact same code with the same data in our local envs and everything runs fine with no errors.

Hey @tim-mcwilliams,

got exactly, I mean exactly the same problem. Have you found any solution?

Hey @Piotrus321 , 

I have not found any solution as of yet. I've been messing with cluster configs, but it seems to be a bigger problem here than compute power.

Piotrus321
New Contributor II

Hey @tim-mcwilliams 

I think I've found a solution that seems to work. It's seems that py-mc marketing displayed output somehow crashed the databricks cell. I disabled it by adding %%capture at the beginning of the cell and ; at the end of the cell.

Hey @Piotrus321 

Good find! I gave that a try but unfortunately I am getting the same behavior. I added %%capture to both the beginning and end of the cell that run the model fitting code. The cell ran for about an hour and a half, while I was doing some other work. Came back to it and canceled the cell, but it still hung up on me.

My data isn't big, about 4 months worth with about 6 variables. The same model run in about 1.5 mins on my local machine.  

Mickel
New Contributor II

This can be a frustrating situation where the notebook cell appears stuck, but the code execution actually finishes in the background. Here are some steps you can troubleshoot to resolve this: camzap bazoocam

1. Restart vs Interrupt:

  • Try using the "Restart" option for the cell instead of "Interrupt." Interrupting might not always gracefully stop the code execution, leading to a hanging progress bar.

2. Check for Deadlocks:

  • Ensure there are no dependencies between cells that could cause a deadlock. If a cell relies on the output of another cell that's stuck, it might appear frozen. Run cells independently or restructure your code to avoid dependencies.

3. Identify Long-Running Processes:

  • If your code involves intensive computations or external API calls, it might take longer than expected. Consider adding progress bars or print statements within the code to track progress and identify potential bottlenecks.

4. Resource Constraints:

  • In environments like Databricks notebooks, ensure your cluster has sufficient resources (CPU, memory) to handle the workload. An under-provisioned cluster can lead to slow execution and a hung appearance.

5. Concurrency and Parallelism:

  • Some notebooks allow parallel execution of cells. If other cells are running concurrently, they might compete for resources and slow down the current cell's progress bar update. Try running the problematic cell in isolation.

6. Logging and Debugging:

  • Add logging statements within your code to track its execution flow. This can help pinpoint where the actual hang-up is occurring.

7. Update Libraries and Restart Kernel:

  • Outdated libraries or a corrupted kernel might cause unexpected behavior. Try updating any relevant libraries and restarting the notebook kernel to see if the issue persists.

8. Consider Alternatives:

  • If the problem persists, consider breaking down your code into smaller, more manageable chunks. This can improve performance and make it easier to identify bottlenecks.

haseebasif
New Contributor II

hi @tim-mcwilliams, Did you manage to fix the issue or identify the root cause?

It would be really helpful to know. Thanks a lot. 

g000gl
New Contributor II

hi @tim-mcwilliams 

Did you manage to fix the issue or identify the root cause?

Metodi_Simeonov
New Contributor II

Hi @tim-mcwilliams, @haseebasif@Mickel@g000gl@Piotrus321,

I encountered a similar problem with a Prophet Markov chain Monte Carlo (MCMC) model that caused my browser to completely drain both the CPU and RAM. Even after the workflow was completed, attempting to open either a notebook or a script used to run the code resulted in the same issue.

I suspect the behaviour you are experiencing is similar to mine and is related to the large amount of log output generated during the Bayesian inference process.

In my case, this behaviour was caused by the cmdstanpy library which is what runs the MCMC simulation and likely stems from the underlying C++ compiled code. Therefore, using the logging library to set different levels for the python packages does not work.

Since the output is printed out directly to the stdout and stderr streams by the underlying Stan processes, regardless of whether you are running the code from a notebook or a script file, you can just redirect both stdout and stderr streams while running the code with a custom contextmanager class.

 
import os
import sys
from contextlib import contextmanager @contextmanager def suppress_stdout_stderr(): with open(os.devnull, 'w') as devnull: old_stdout = sys.stdout old_stderr = sys.stderr try: sys.stdout = devnull sys.stderr = devnull yield finally: sys.stdout = old_stdout sys.stderr = old_stderr

Then simply wrap your training/fitting code with the method:

with suppress_stdout_stderr():
    model = your_model_class()
    model.fit(your_data)

This way the context manager opens a null file for writing the logs and then discards all data written to it.

Lingesh
Databricks Employee
Databricks Employee

@tim-mcwilliams I'm not sure if you found a workaround or a fix for this issue.

We have recently found another issue (Integration between PyMC and Databricks Kernel does not go well. Specifically, the rendering logic of the progress bar in PyMC) that I think is similar and relates to the issue described.  While we are yet to root out the issue, it has been found that disabling the progress bar (progressbar=False) helped to keep the Notebook cell alive/responsive. You could try disabling the progress bar a try.

Here is how the workaround we used for the other use case involving PyMC

trace_04 = pm.sample(nuts={'target_accept':0.8}, var_names=["alpha", "delta", "mu", "sig"], progressbar=False)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group