Databricks Community

mh-hsn · 3 weeks ago

I have a pickle file "vectorizer.pkl" and I am currently facing an inconsistent behavior when trying to load that file. Sometimes it gets loaded successfully and sometimes I face an error. Here is how I am trying to load the file:

from joblib import load
tmp_path = client.download_artifacts(run_id=run_id, path='')
vectorizer = load(os.path.join(tmp_path, 'vectorizer.pickle'))

The error that I get is:

ConnectException error
This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

There are two things to note about the above issue:

The size of the pickle file is just 7.5 mb.
There is no other process running on the cluster.

I have experienced the same inconsistent behavior on multiple clusters. Here are the specs of my two different clusters:

My old cluster (My main cluster):

9.1-LTS ML (includes Apache Spark 3.1.2, Scala 2.12)

Worker type: Standard_D16s_v3 (min "1", max "8" )

Driver type: Standard_D64s_v3

Spot instances = True

My new cluster (Created just to reproduce the error):

9.1-LTS ML (includes Apache Spark 3.1.2, Scala 2.12)

Worker type: Standard_DS3_v2 (min "1", max "8" )

Driver type: Standard_DS3_v2

Spot instances = True

A bit more information about the experiment that I performed after creating my new cluster. When I created new cluster and tried to load the pickle file, first time I got the following error:

joblib.load RecursionError: maximum recursion depth exceeded while calling a Python object

When I google searched this error, I came across a few threads where they suggested to increase the recursion limit. So I added following two lines in my code:

import sys
sys.setrecursionlimit(30000)

After adding the above two lines I got the same error that I got on my main cluster. i.e.

ConnectException error
This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

Now the next day when I executed the same piece of code (without newly added two lines) on my new cluster the code executed just fine i.e. it was able to load the pickle file.

I am currently experiencing the same inconsistent behavior on both of my clusters. Even on my main cluster, my parent notebook calls the child notebook twice, this child notebook in turn load the pickle file. In my failed workflow run, first time it loaded the same file just fine and ran into issue when the child notebook was called again later in the parent notebook.