Databricks Community

danmlopsmaz · ‎07-20-2024

I've encountered a significant issue while using the VSCode extension for Databricks, particularly when working with a cluster configured with a Docker image. Here's a detailed description of the problem:

Problem Description

When attempting to upload and execute a Python file with VSCode to a Databricks cluster that utilizes a custom Docker image, the connection fails, and the extension does not function as expected.

==============================
Errors in 00-databricks-init-3331c3ed293013bfec5837e683d00cfe.py:
 
WARNING -  All log messages before absl: :InitializeLog() is called are written to STDERR
I0000 00 - 00: 1721481367.546267  105941 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache

Error: CommandExecution.createAndWait: failed to reach Running state, got Error: [object Object]

7/20/2024, 8:34:11 AM - Creating execution context on cluster 0719 ...
Error: CommandExecution.createAndWait: failed to reach Running state, got Error: [object Object]
Execution terminated

Potential Workarounds

Databricks connect: Run the databricks connect in a terminal works to execute the spark code in the cluster. But, the VS Code extension does not.

Note

It is important to mention that when I run the same Python file with a standard cluster with no docker on it, the VSCode extension works as expected.

danmlopsmaz · ‎07-22-2024

Hi @Retired_mod thanks for a such quick response.

Actually, I am using the Dockerfile from the Databricks runtime example here: https://github.com/databricks/containers/blob/master/ubuntu/minimal/Dockerfile . The configuration with the VSCode extensions is fine since I already mentioned that the "upload and run python file" command works with a standard cluster.

This is my Dockerfile:

# This Dockerfile creates a clean Databricks runtime 12.2 LTS without any library ready to deploy to Databricks
FROM databricksruntime/minimal:14.3-LTS
# These are the versions compatible for DBR 12.x

ARG python_version="3.9"
ARG pip_version="22.3.1"
ARG setuptools_version="65.6.3"
ARG wheel_version="0.38.4"
ARG virtualenv_version="20.16.7"

# Set the debconf frontend to Noninteractive
RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

# Installs python 3.x and virtualenv for Spark and Notebooks
RUN sudo apt-get update && sudo apt-get install dialog apt-utils curl build-essential fuse openssh-server software-properties-common --yes \
    && sudo add-apt-repository ppa:deadsnakes/ppa -y && sudo apt-get update \
    && sudo apt-get install python${python_version} python${python_version}-dev python${python_version}-distutils --yes \
    && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py \
    && /usr/bin/python${python_version} get-pip.py pip>=${pip_version} setuptools>=${setuptools_version} wheel>=${wheel_version} \
    && rm get-pip.py \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

RUN /usr/local/bin/pip${python_version} install --no-cache-dir virtualenv==${virtualenv_version} \
    && sed -i -r 's/^(PERIODIC_UPDATE_ON_BY_DEFAULT) = True$/\1 = False/' /usr/local/lib/python${python_version}/dist-packages/virtualenv/seed/embed/base_embed.py \
    && /usr/local/bin/pip${python_version} download pip==${pip_version} --dest \
    /usr/local/lib/python${python_version}/dist-packages/virtualenv_support/

# Initialize the default environment that Spark and notebooks will use
RUN virtualenv --python=python${python_version} --system-site-packages /databricks/python3 --no-download  --no-setuptools

# These python libraries are used by Databricks notebooks and the Python REPL
# You do not need to install pyspark - it is injected when the cluster is launched
# Versions are intended to reflect latest DBR: https://docs.databricks.com/release-notes/runtime/11.1.html#system-environment
RUN /databricks/python3/bin/pip install \
    six>=1.16.0 \
    jedi>=0.18.1 \
    # ensure minimum ipython version for Python autocomplete with jedi 0.17.x
    ipython>=8.10.0 \
    pyarrow>=8.0.0 \
    ipykernel>=6.17.1 \
    grpcio>=1.48.1 \
    grpcio-status>=1.48.1 \
    databricks-sdk>=0.1.6

# Specifies where Spark will look for the python process
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3
# Specifies Tracking URI for MLflow Integration
ENV MLFLOW_TRACKING_URI='databricks'
# Make sure the USER env variable is set. The files exposed
# by dbfs-fuse will be owned by this user.
# Within the container, the USER is always root.
ENV USER root