topic Re: databricks.sql.exc.RequestError OpenSession error None in Data Engineering

databricks.sql.exc.RequestError OpenSession error None

Etyr — Mon, 22 Jan 2024 14:59:05 GMT

I'm trying to access to a Databricks SQL Warehouse with python. I'm able to connect with a token on a Compute Instance on Azure Machine Learning. It's a VM with conda installed, I create an env in python 3.10.

from databricks import sql as dbsql

dbsql.connect(
        server_hostname="databricks_address",
        http_path="http_path",
        access_token="dapi....",
    )

But once I create a job and I Launch it in a compute Cluster with a custom Dockerfile

FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04:latest


ENV https_proxy http://xxxxxx:yyyy
ENV no_proxy xxxxxx

RUN mkdir -p /usr/share/man/man1

RUN wget https://download.java.net/java/GA/jdk19.0.1/afdd2e245b014143b62ccb916125e3ce/10/GPL/openjdk-19.0.1_linux-x64_bin.tar.gz \
    && tar xvf openjdk-19.0.1_linux-x64_bin.tar.gz \
    && mv jdk-19.0.1 /opt/

ENV JAVA_HOME /opt/jdk-19.0.1
ENV PATH="${PATH}:$JAVA_HOME/bin"

# Install requirements with pip conf for Jfrog
COPY pip.conf pip.conf
ENV PIP_CONFIG_FILE pip.conf


# python installs (python 3.10 inside all azure ubuntu images)
COPY requirements.txt .
RUN pip install -r requirements.txt && rm requirements.txt

# set command
CMD ["bash"]

My image is created and starts to run my code, but fails on previous code sample. I am using the same values of https_proxy and no_poxy in my compute instance and compute cluster.

2024-01-22 13:30:13,520 - thrift_backend - Error during request to server: {"method": "OpenSession", "session-id": null, "query-id": null, "http-code": null, "error-message": "", "original-exception": "Retry request would exceed Retry policy max retry duration of 900.0 seconds", "no-retry-reason": "non-retryable error", "bounded-retry-delay": null, "attempt": "1/30", "elapsed-seconds": "846.7684090137482/900.0"}
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/67f1e8c93a8942d582fb7babc030101b/exe/wd/main.py", line 198, in <module>
    main()
  File "/mnt/azureml/cr/j/67f1e8c93a8942d582fb7babc030101b/exe/wd/main.py", line 31, in main
    return dbsql.connect(
  File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/__init__.py", line 51, in connect
    return Connection(server_hostname, http_path, access_token, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/client.py", line 235, in __init__
    self._open_session_resp = self.thrift_backend.open_session(
  File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/thrift_backend.py", line 576, in open_session
    response = self.make_request(self._client.OpenSession, open_session_req)
  File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/thrift_backend.py", line 505, in make_request
    self._handle_request_error(error_info, attempt, elapsed)
  File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/thrift_backend.py", line 335, in _handle_request_error
    raise network_request_error
databricks.sql.exc.RequestError: Error during request to server

In both, I am using the lastest version of databricks-sql-connector (3.0.1)

Re: databricks.sql.exc.RequestError OpenSession error None

Debayan — Tue, 23 Jan 2024 03:32:24 GMT

Hi, Could you please try https://github.com/databricks/databricks-sql-python/issues/23 and let us know if this helps (adding a new token)?

Re: databricks.sql.exc.RequestError OpenSession error None

Etyr — Tue, 23 Jan 2024 07:57:57 GMT

Hello,

I am already recreating a new token at each time I init my Spark session. I do this using the Azure's oauth2 service to get a token lasting 1 hour and then using databricks API 2.0 to generate a new PAT.
And this code is working in local and in compute instances in Azure, but not Compute Clusters.

What I also tried: To generate a token in UI, working in local, then using it in my code in my compute cluster, and not working with the above error.

Cloud it be a network issue? I'm creating both compute instance/cluster in terraform:

resource "azurerm_machine_learning_compute_cluster" "cluster" { for_each = local.compute_cluster_configurations name = each.key location = var.context.location vm_priority = each.value.vm_priority vm_size = each.value.vm_size machine_learning_workspace_id = module.mlw_01.id subnet_resource_id = module.subnet_aml.id # AML-05 ssh_public_access_enabled = false node_public_ip_enabled = false identity { type = "UserAssigned" identity_ids = [ azurerm_user_assigned_identity.compute_cluster_managed_identity.id ] } scale_settings { min_node_count = each.value.min_node_count max_node_count = each.value.max_node_count scale_down_nodes_after_idle_duration = each.value.scale_down_nodes_after_idle_duration } }

# For each user, create a compute instance resource "azurerm_machine_learning_compute_instance" "this" { for_each = local.all_users name = "${split("@", trimspace(local.all_users[each.key]["user_principal_name"]))[0]}-DS2-V2" location = var.context.location machine_learning_workspace_id = module.mlw_01.id virtual_machine_size = "STANDARD_DS2_V2" identity { type = "UserAssigned" identity_ids = [ azurerm_user_assigned_identity.this[each.key].id ] } assign_to_user { object_id = each.key tenant_id = var.tenant_id } node_public_ip_enabled = false subnet_resource_id = module.subnet_aml.id description = "Compute instance generated by Terraform for : ${local.all_users[each.key]["display_name"]} | ${local.all_users[each.key]["user_principal_name"]} | ${each.key} " }

I'm using the same subnet, so it should react the same in network.

Re: databricks.sql.exc.RequestError OpenSession error None

Etyr — Mon, 29 Jan 2024 08:39:25 GMT

The issue was that the new version of databricks-sql-connector (3.0.1) does not handle well error messages. So It gave a generic error and a timeout where it should have given me 403 and instant error message without a 900 second timeout.

https://github.com/databricks/databricks-sql-python/issues/333

I've commented on a github issue for more debugging.

But I'm still wondering why I got 403 error from my compute cluster and not my compute instance where they have the same roles. So I had to add a role on the group handling both Service Principal in databricks to user SQL warehouse. Which is odd.