Issue with VSCode Extension and Databricks Cluster Using Docker Image
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-20-2024 06:36 AM - edited 07-20-2024 06:51 AM
I've encountered a significant issue while using the VSCode extension for Databricks, particularly when working with a cluster configured with a Docker image. Here's a detailed description of the problem:
Problem Description
When attempting to upload and execute a Python file with VSCode to a Databricks cluster that utilizes a custom Docker image, the connection fails, and the extension does not function as expected.
==============================
Errors in 00-databricks-init-3331c3ed293013bfec5837e683d00cfe.py:
WARNING - All log messages before absl: :InitializeLog() is called are written to STDERR
I0000 00 - 00: 1721481367.546267 105941 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache
Error: CommandExecution.createAndWait: failed to reach Running state, got Error: [object Object]
7/20/2024, 8:34:11 AM - Creating execution context on cluster 0719 ...
Error: CommandExecution.createAndWait: failed to reach Running state, got Error: [object Object]
Execution terminated
Potential Workarounds
Databricks connect: Run the databricks connect in a terminal works to execute the spark code in the cluster. But, the VS Code extension does not.
Note
It is important to mention that when I run the same Python file with a standard cluster with no docker on it, the VSCode extension works as expected.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2024 01:41 PM
Hi @Retired_mod thanks for a such quick response.
Actually, I am using the Dockerfile from the Databricks runtime example here: https://github.com/databricks/containers/blob/master/ubuntu/minimal/Dockerfile . The configuration with the VSCode extensions is fine since I already mentioned that the "upload and run python file" command works with a standard cluster.
This is my Dockerfile:
# This Dockerfile creates a clean Databricks runtime 12.2 LTS without any library ready to deploy to Databricks
FROM databricksruntime/minimal:14.3-LTS
# These are the versions compatible for DBR 12.x
ARG python_version="3.9"
ARG pip_version="22.3.1"
ARG setuptools_version="65.6.3"
ARG wheel_version="0.38.4"
ARG virtualenv_version="20.16.7"
# Set the debconf frontend to Noninteractive
RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
# Installs python 3.x and virtualenv for Spark and Notebooks
RUN sudo apt-get update && sudo apt-get install dialog apt-utils curl build-essential fuse openssh-server software-properties-common --yes \
&& sudo add-apt-repository ppa:deadsnakes/ppa -y && sudo apt-get update \
&& sudo apt-get install python${python_version} python${python_version}-dev python${python_version}-distutils --yes \
&& curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py \
&& /usr/bin/python${python_version} get-pip.py pip>=${pip_version} setuptools>=${setuptools_version} wheel>=${wheel_version} \
&& rm get-pip.py \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
RUN /usr/local/bin/pip${python_version} install --no-cache-dir virtualenv==${virtualenv_version} \
&& sed -i -r 's/^(PERIODIC_UPDATE_ON_BY_DEFAULT) = True$/\1 = False/' /usr/local/lib/python${python_version}/dist-packages/virtualenv/seed/embed/base_embed.py \
&& /usr/local/bin/pip${python_version} download pip==${pip_version} --dest \
/usr/local/lib/python${python_version}/dist-packages/virtualenv_support/
# Initialize the default environment that Spark and notebooks will use
RUN virtualenv --python=python${python_version} --system-site-packages /databricks/python3 --no-download --no-setuptools
# These python libraries are used by Databricks notebooks and the Python REPL
# You do not need to install pyspark - it is injected when the cluster is launched
# Versions are intended to reflect latest DBR: https://docs.databricks.com/release-notes/runtime/11.1.html#system-environment
RUN /databricks/python3/bin/pip install \
six>=1.16.0 \
jedi>=0.18.1 \
# ensure minimum ipython version for Python autocomplete with jedi 0.17.x
ipython>=8.10.0 \
pyarrow>=8.0.0 \
ipykernel>=6.17.1 \
grpcio>=1.48.1 \
grpcio-status>=1.48.1 \
databricks-sdk>=0.1.6
# Specifies where Spark will look for the python process
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3
# Specifies Tracking URI for MLflow Integration
ENV MLFLOW_TRACKING_URI='databricks'
# Make sure the USER env variable is set. The files exposed
# by dbfs-fuse will be owned by this user.
# Within the container, the USER is always root.
ENV USER root

