01-22-2024 06:59 AM
I'm trying to access to a Databricks SQL Warehouse with python. I'm able to connect with a token on a Compute Instance on Azure Machine Learning. It's a VM with conda installed, I create an env in python 3.10.
from databricks import sql as dbsql dbsql.connect( server_hostname="databricks_address", http_path="http_path", access_token="dapi....", )
But once I create a job and I Launch it in a compute Cluster with a custom Dockerfile
FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04:latest ENV https_proxy http://xxxxxx:yyyy ENV no_proxy xxxxxx RUN mkdir -p /usr/share/man/man1 RUN wget https://download.java.net/java/GA/jdk19.0.1/afdd2e245b014143b62ccb916125e3ce/10/GPL/openjdk-19.0.1_linux-x64_bin.tar.gz \ && tar xvf openjdk-19.0.1_linux-x64_bin.tar.gz \ && mv jdk-19.0.1 /opt/ ENV JAVA_HOME /opt/jdk-19.0.1 ENV PATH="${PATH}:$JAVA_HOME/bin" # Install requirements with pip conf for Jfrog COPY pip.conf pip.conf ENV PIP_CONFIG_FILE pip.conf # python installs (python 3.10 inside all azure ubuntu images) COPY requirements.txt . RUN pip install -r requirements.txt && rm requirements.txt # set command CMD ["bash"]
My image is created and starts to run my code, but fails on previous code sample. I am using the same values of https_proxy and no_poxy in my compute instance and compute cluster.
2024-01-22 13:30:13,520 - thrift_backend - Error during request to server: {"method": "OpenSession", "session-id": null, "query-id": null, "http-code": null, "error-message": "", "original-exception": "Retry request would exceed Retry policy max retry duration of 900.0 seconds", "no-retry-reason": "non-retryable error", "bounded-retry-delay": null, "attempt": "1/30", "elapsed-seconds": "846.7684090137482/900.0"} Traceback (most recent call last): File "/mnt/azureml/cr/j/67f1e8c93a8942d582fb7babc030101b/exe/wd/main.py", line 198, in <module> main() File "/mnt/azureml/cr/j/67f1e8c93a8942d582fb7babc030101b/exe/wd/main.py", line 31, in main return dbsql.connect( File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/__init__.py", line 51, in connect return Connection(server_hostname, http_path, access_token, **kwargs) File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/client.py", line 235, in __init__ self._open_session_resp = self.thrift_backend.open_session( File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/thrift_backend.py", line 576, in open_session response = self.make_request(self._client.OpenSession, open_session_req) File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/thrift_backend.py", line 505, in make_request self._handle_request_error(error_info, attempt, elapsed) File "/opt/miniconda/lib/python3.10/site-packages/databricks/sql/thrift_backend.py", line 335, in _handle_request_error raise network_request_error databricks.sql.exc.RequestError: Error during request to server
In both, I am using the lastest version of databricks-sql-connector (3.0.1)
01-22-2024 07:32 PM
Hi, Could you please try https://github.com/databricks/databricks-sql-python/issues/23 and let us know if this helps (adding a new token)?
01-22-2024 11:57 PM
Hello,
I am already recreating a new token at each time I init my Spark session. I do this using the Azure's oauth2 service to get a token lasting 1 hour and then using databricks API 2.0 to generate a new PAT.
And this code is working in local and in compute instances in Azure, but not Compute Clusters.
What I also tried: To generate a token in UI, working in local, then using it in my code in my compute cluster, and not working with the above error.
Cloud it be a network issue? I'm creating both compute instance/cluster in terraform:
resource "azurerm_machine_learning_compute_cluster" "cluster" {
for_each = local.compute_cluster_configurations
name = each.key
location = var.context.location
vm_priority = each.value.vm_priority
vm_size = each.value.vm_size
machine_learning_workspace_id = module.mlw_01.id
subnet_resource_id = module.subnet_aml.id
# AML-05
ssh_public_access_enabled = false
node_public_ip_enabled = false
identity {
type = "UserAssigned"
identity_ids = [
azurerm_user_assigned_identity.compute_cluster_managed_identity.id
]
}
scale_settings {
min_node_count = each.value.min_node_count
max_node_count = each.value.max_node_count
scale_down_nodes_after_idle_duration = each.value.scale_down_nodes_after_idle_duration
}
}
# For each user, create a compute instance
resource "azurerm_machine_learning_compute_instance" "this" {
for_each = local.all_users
name = "${split("@", trimspace(local.all_users[each.key]["user_principal_name"]))[0]}-DS2-V2"
location = var.context.location
machine_learning_workspace_id = module.mlw_01.id
virtual_machine_size = "STANDARD_DS2_V2"
identity {
type = "UserAssigned"
identity_ids = [
azurerm_user_assigned_identity.this[each.key].id
]
}
assign_to_user {
object_id = each.key
tenant_id = var.tenant_id
}
node_public_ip_enabled = false
subnet_resource_id = module.subnet_aml.id
description = "Compute instance generated by Terraform for : ${local.all_users[each.key]["display_name"]} | ${local.all_users[each.key]["user_principal_name"]} | ${each.key} "
}
I'm using the same subnet, so it should react the same in network.
01-29-2024 12:39 AM
The issue was that the new version of databricks-sql-connector (3.0.1) does not handle well error messages. So It gave a generic error and a timeout where it should have given me 403 and instant error message without a 900 second timeout.
https://github.com/databricks/databricks-sql-python/issues/333
I've commented on a github issue for more debugging.
But I'm still wondering why I got 403 error from my compute cluster and not my compute instance where they have the same roles. So I had to add a role on the group handling both Service Principal in databricks to user SQL warehouse. Which is odd.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group