โ07-26-2022 12:47 AM
tl;dr: A cell that executes purely on the head node stops printed output during execution, but output still shows up in the cluster logs. After execution of the cell, Databricks does not notice the cell is finished and gets stuck. When trying to cancel, Databricks gets stuck as well, and we need to "Clear state".
Long version:
We use the tsfresh library (https://github.com/blue-yonder/tsfresh) in Databricks on a head node (no Spark - just Python). On most runs, the output of the notebook cell simply stops - while the cell is still being executed. This means that in the notebook itself, no new output is shown, even though the cell keeps running in the background. We know this because files generated by this cell are still written, and also, in Cluster -> Driver Logs, output keeps appearing.
This in itself wouldn't really be a problem, however, Databricks doesn't ever realize the cell is finished - meaning the next cell never gets executed. Also, the cell cannot be cancelled the regular way, we need to clear state, meaning losing all computation results that haven't been written out. Simply cancelling gets stuck.
This happened with Runtime 7.3 LTS, we switched to 10.4 LTS now and the problem is still persists. We tried different head node sizes and sometimes it gets stuck sooner, sometimes later, the behavior isn't consistent. We assume it has something to do with how tsfresh handles multitasking, but the problem seems to happen even if we turn off multitasking.
On local versions of Python notebooks, this never happens, leading us to assume it is a problem / bug with Databricks itself.
Any pointers what we can try / how we get in contact with someone from Databricks to check this?
โ07-27-2022 05:23 AM
it won't use spark unless you call spark functions (a sparkcontext will be created automatically though).
Maybe you can try using the iPython kernel. As from Databricks 11.0 it is the default kernel for python workloads, so I'd try that.
โ07-26-2022 01:10 AM
Iยดd open a support ticket @ Databricks (probably has to go via your cloud provider).
โ07-26-2022 06:00 AM
As that library work on pandas problem can be that it doesn't support pandas on spark. On the local version, you probably use non-distributed pandas. You can check behavior by switching between:
import pandas as pd
import pyspark.pandas as pd
โ07-27-2022 05:17 AM
Do you mean that it uses Spark even if I don't tell it to, somehow recognizing it? Because I am not using spark at all, I am using the exact same code as on local, and when I check spark jobs on the Databricks machine there is nothing (as I expect...).
I am using Databricks basically as a "local" machine that I can quickly deploy in the cloud, I am not intending to use any of the Spark / cluster functionality...
โ07-27-2022 05:23 AM
it won't use spark unless you call spark functions (a sparkcontext will be created automatically though).
Maybe you can try using the iPython kernel. As from Databricks 11.0 it is the default kernel for python workloads, so I'd try that.
โ08-01-2022 06:58 AM
Thanks! This one actually seems to solve the problem, so I assume the IPython kernel did the trick. Do you know what was used instead, in versions <11.0? Doesn't seem to say so in the docu...
โ08-01-2022 07:00 AM
I suppose it ran on databricks, they probably created a custom kernel with similar properties.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group