Databricks Community

danmlopsmaz · ‎07-20-2024

I've encountered a significant issue while using the VSCode extension for Databricks, particularly when working with a cluster configured with a Docker image. Here's a detailed description of the problem:

Problem Description

When attempting to upload and execute a Python file with VSCode to a Databricks cluster that utilizes a custom Docker image, the connection fails, and the extension does not function as expected.

==============================
Errors in 00-databricks-init-3331c3ed293013bfec5837e683d00cfe.py:
 
WARNING -  All log messages before absl: :InitializeLog() is called are written to STDERR
I0000 00 - 00: 1721481367.546267  105941 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache

Error: CommandExecution.createAndWait: failed to reach Running state, got Error: [object Object]

7/20/2024, 8:34:11 AM - Creating execution context on cluster 0719 ...
Error: CommandExecution.createAndWait: failed to reach Running state, got Error: [object Object]
Execution terminated

Potential Workarounds

Databricks connect: Run the databricks connect in a terminal works to execute the spark code in the cluster. But, the VS Code extension does not.

Note

It is important to mention that when I run the same Python file with a standard cluster with no docker on it, the VSCode extension works as expected.

Kaniz_Fatma · ‎07-22-2024

Hi @danmlopsmaz, Could you please ensure that the custom Docker image you are using is compatible with the Databricks runtime? Sometimes, discrepancies between the image and the runtime can lead to failures in command execution.

Since you mentioned that running Databricks Connect in a terminal works, you might want to ensure that the VSCode extension is correctly configured to use Databricks Connect.
Follow the Databricks Connect setup guide to ensure all steps are correctly implemented.

danmlopsmaz · ‎07-22-2024

Hi @Kaniz_Fatma thanks for a such quick response.

Actually, I am using the Dockerfile from the Databricks runtime example here: https://github.com/databricks/containers/blob/master/ubuntu/minimal/Dockerfile . The configuration with the VSCode extensions is fine since I already mentioned that the "upload and run python file" command works with a standard cluster.

This is my Dockerfile:

# This Dockerfile creates a clean Databricks runtime 12.2 LTS without any library ready to deploy to Databricks
FROM databricksruntime/minimal:14.3-LTS
# These are the versions compatible for DBR 12.x

ARG python_version="3.9"
ARG pip_version="22.3.1"
ARG setuptools_version="65.6.3"
ARG wheel_version="0.38.4"
ARG virtualenv_version="20.16.7"

# Set the debconf frontend to Noninteractive
RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

# Installs python 3.x and virtualenv for Spark and Notebooks
RUN sudo apt-get update && sudo apt-get install dialog apt-utils curl build-essential fuse openssh-server software-properties-common --yes \
    && sudo add-apt-repository ppa:deadsnakes/ppa -y && sudo apt-get update \
    && sudo apt-get install python${python_version} python${python_version}-dev python${python_version}-distutils --yes \
    && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py \
    && /usr/bin/python${python_version} get-pip.py pip>=${pip_version} setuptools>=${setuptools_version} wheel>=${wheel_version} \
    && rm get-pip.py \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

RUN /usr/local/bin/pip${python_version} install --no-cache-dir virtualenv==${virtualenv_version} \
    && sed -i -r 's/^(PERIODIC_UPDATE_ON_BY_DEFAULT) = True$/\1 = False/' /usr/local/lib/python${python_version}/dist-packages/virtualenv/seed/embed/base_embed.py \
    && /usr/local/bin/pip${python_version} download pip==${pip_version} --dest \
    /usr/local/lib/python${python_version}/dist-packages/virtualenv_support/

# Initialize the default environment that Spark and notebooks will use
RUN virtualenv --python=python${python_version} --system-site-packages /databricks/python3 --no-download  --no-setuptools

# These python libraries are used by Databricks notebooks and the Python REPL
# You do not need to install pyspark - it is injected when the cluster is launched
# Versions are intended to reflect latest DBR: https://docs.databricks.com/release-notes/runtime/11.1.html#system-environment
RUN /databricks/python3/bin/pip install \
    six>=1.16.0 \
    jedi>=0.18.1 \
    # ensure minimum ipython version for Python autocomplete with jedi 0.17.x
    ipython>=8.10.0 \
    pyarrow>=8.0.0 \
    ipykernel>=6.17.1 \
    grpcio>=1.48.1 \
    grpcio-status>=1.48.1 \
    databricks-sdk>=0.1.6

# Specifies where Spark will look for the python process
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3
# Specifies Tracking URI for MLflow Integration
ENV MLFLOW_TRACKING_URI='databricks'
# Make sure the USER env variable is set. The files exposed
# by dbfs-fuse will be owned by this user.
# Within the container, the USER is always root.
ENV USER root