Databricks

Etyr · ‎01-22-2024

I'm trying to access to a Databricks SQL Warehouse with python. I'm able to connect with a token on a Compute Instance on Azure Machine Learning. It's a VM with conda installed, I create an env in python 3.10.

from databricks import sql as dbsql

dbsql.connect(
        server_hostname="databricks_address",
        http_path="http_path",
        access_token="dapi....",
    )

But once I create a job and I Launch it in a compute Cluster with a custom Dockerfile

FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04:latest


ENV https_proxy http://xxxxxx:yyyy
ENV no_proxy xxxxxx

RUN mkdir -p /usr/share/man/man1

RUN wget https://download.java.net/java/GA/jdk19.0.1/afdd2e245b014143b62ccb916125e3ce/10/GPL/openjdk-19.0.1_linux-x64_bin.tar.gz \
    && tar xvf openjdk-19.0.1_linux-x64_bin.tar.gz \
    && mv jdk-19.0.1 /opt/

ENV JAVA_HOME /opt/jdk-19.0.1
ENV PATH="${PATH}:$JAVA_HOME/bin"

# Install requirements with pip conf for Jfrog
COPY pip.conf pip.conf
ENV PIP_CONFIG_FILE pip.conf


# python installs (python 3.10 inside all azure ubuntu images)
COPY requirements.txt .
RUN pip install -r requirements.txt && rm requirements.txt

# set command
CMD ["bash"]

My image is created and starts to run my code, but fails on previous code sample. I am using the same values of https_proxy and no_poxy in my compute instance and compute cluster.

2024-01-22 13:30:13,520 - thrift_backend - Error during request to server: {"method": "OpenSession", "session-id": null, "query-id": null, "http-code": null, "error-message": "", "original-exception": "Retry request would exceed Retry policy max retry duration of 900.0 seconds", "no-retry-reason": "non-retryable error", "bounded-retry-delay": null, "attempt": "1/30", "elapsed-seconds": "846.7684090137482/900.0"}
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/67f1e8c93a8942d582fb7babc030101b/exe/wd/main.py", line 198, in <module>
    main()
  File "/mnt/azureml/cr/j/67f1e8c93a8942d582fb7babc030101b/exe/wd/main.py", line 31, in main
    return dbsql.connect(
  File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/__init__.py", line 51, in connect
    return Connection(server_hostname, http_path, access_token, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/client.py", line 235, in __init__
    self._open_session_resp = self.thrift_backend.open_session(
  File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/thrift_backend.py", line 576, in open_session
    response = self.make_request(self._client.OpenSession, open_session_req)
  File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/thrift_backend.py", line 505, in make_request
    self._handle_request_error(error_info, attempt, elapsed)
  File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/thrift_backend.py", line 335, in _handle_request_error
    raise network_request_error
databricks.sql.exc.RequestError: Error during request to server

In both, I am using the lastest version of databricks-sql-connector (3.0.1)

Debayan · ‎01-22-2024

Hi, Could you please try https://github.com/databricks/databricks-sql-python/issues/23 and let us know if this helps (adding a new token)?

Etyr · ‎01-22-2024

Hello,

I am already recreating a new token at each time I init my Spark session. I do this using the Azure's oauth2 service to get a token lasting 1 hour and then using databricks API 2.0 to generate a new PAT.
And this code is working in local and in compute instances in Azure, but not Compute Clusters.

What I also tried: To generate a token in UI, working in local, then using it in my code in my compute cluster, and not working with the above error.

Cloud it be a network issue? I'm creating both compute instance/cluster in terraform:

resource "azurerm_machine_learning_compute_cluster" "cluster" {
  for_each = local.compute_cluster_configurations

  name     = each.key
  location = var.context.location

  vm_priority                   = each.value.vm_priority
  vm_size                       = each.value.vm_size
  machine_learning_workspace_id = module.mlw_01.id
  subnet_resource_id            = module.subnet_aml.id

  # AML-05
  ssh_public_access_enabled = false
  node_public_ip_enabled    = false

  identity {
    type = "UserAssigned"
    identity_ids = [
      azurerm_user_assigned_identity.compute_cluster_managed_identity.id
    ]
  }

  scale_settings {
    min_node_count                       = each.value.min_node_count
    max_node_count                       = each.value.max_node_count
    scale_down_nodes_after_idle_duration = each.value.scale_down_nodes_after_idle_duration
  }
}

# For each user, create a compute instance
resource "azurerm_machine_learning_compute_instance" "this" {
  for_each = local.all_users

  name                          = "${split("@", trimspace(local.all_users[each.key]["user_principal_name"]))[0]}-DS2-V2"
  location                      = var.context.location
  machine_learning_workspace_id = module.mlw_01.id
  virtual_machine_size          = "STANDARD_DS2_V2"
  identity {
    type = "UserAssigned"
    identity_ids = [
      azurerm_user_assigned_identity.this[each.key].id
    ]
  }
  assign_to_user {
    object_id = each.key
    tenant_id = var.tenant_id
  }
  node_public_ip_enabled = false
  subnet_resource_id     = module.subnet_aml.id
  description            = "Compute instance generated by Terraform for : ${local.all_users[each.key]["display_name"]} | ${local.all_users[each.key]["user_principal_name"]} | ${each.key} "
}

I'm using the same subnet, so it should react the same in network.

Etyr · ‎01-29-2024

The issue was that the new version of databricks-sql-connector (3.0.1) does not handle well error messages. So It gave a generic error and a timeout where it should have given me 403 and instant error message without a 900 second timeout.

https://github.com/databricks/databricks-sql-python/issues/333

I've commented on a github issue for more debugging.

But I'm still wondering why I got 403 error from my compute cluster and not my compute instance where they have the same roles. So I had to add a role on the group handling both Service Principal in databricks to user SQL warehouse. Which is odd.

Databricks

databricks.sql.exc.RequestError OpenSession error None

Databricks Community Social - May 2024

🔔 Attention Databricks Academy Users: SSO Implementation Incoming! Secure Your Account Today!

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs