cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

Corrupted Python installation on Python restart on DBR 13.3

ivanychev
Contributor

Hey there, we're using DBR 13.3 (no Docker) as general purpose cluster and init the cluster using the following init script:

```

#!/usr/bin/env bash
export DEBIAN_FRONTEND=noninteractive
set -euxo pipefail

if [[ $DB_IS_DRIVER = "TRUE" ]]; then
echo "I am driver"
else
echo "I am executor"
fi

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip -q awscliv2.zip
./aws/install
rm -rf awscliv2.zip aws

aws s3 cp "s3://constructor-analytics-data/deploy/dp_release/${MODE}/latest/dp_requirements.txt" /tmp/all_requirements.txt

/databricks/python/bin/pip install -U pip wheel
/databricks/python/bin/pip install --no-cache-dir --no-deps -r /tmp/requirements.txt

```

The init script, in particular, install boto3==1.29.7 (not boto3==1.24.28 from vanilla distribution https://docs.databricks.com/en/release-notes/runtime/13.3lts.html)

When there's any OOM happening on the Python side, the driver doesn't restart the node, but (apparently) the Python interpreter restarts.

After it restarts, boto3 stops working, any S3 operation ends with `An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied`. The reason is that boto3 installation changes (screenshot). 

Note that service-2.json was absent before the OOM but it appeared after. The time of creation is 6 minutes earlier that nearby files, so I suspect this service-2.json was somehow taken from older botocore. This file is used by botocore to construct HTTP API requests from Python calls, and when this directory gets corrupted, boto3 stops working.

Why this file appears suddenly in the botocore library files? Why other files didn't change? What am I doing wrong here?

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @ivanychevI apologize for any misunderstanding in my initial response. Thank you for clarifying the issue. Let’s focus on the specific problem you’re facing with boto3.

It appears that after the Python interpreter restarts due to an OOM event, the resources file within botocore changes unexpectedly. This behavior is causing the Access Denied error during S3 operations.

Here are some steps you can take to address this issue:

  1. Dependency Management:

    • Ensure that the installation of boto3==1.29.7 doesn’t introduce any conflicts or unexpected changes.
    • Consider reverting to the vanilla distribution’s boto3==1.24.28 to match the documented version.
  2. Isolate the Problem:

    • Set up a minimal environment (e.g., a fresh cluster) and reproduce the issue.
    • Monitor the behavior of the resources file during interpreter restarts.
  3. Logs and Diagnostics:

    • Examine logs for any additional error messages or warnings related to botocore.
    • Check if there are any other files or directories affected during the interpreter restart.

If you have any further details or logs, feel free to share them, and I’ll do my best to assist you. 🚀

Kaniz
Community Manager
Community Manager

Hi @ivanychev , Let me get some of our experts here at Databricks to answer your question. Please bear with us until then.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.