01-11-2023 06:15 AM
I'm using DefaultAzureCredential from azure-identity to connect to Azure with service principal environment variables (AZURE_CLIENT_SECRET, AZURE_TENANT_ID, AZURE_CLIENT_ID).
I can get_token from a specific scope for databricks like this:
from azure.identity import DefaultAzureCredential
dbx_scope = "2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default"
token = DefaultAzureCredential().get_token(dbx_scope).token
So this is working great, I get the token, and then I can use `databricks-connect` to configure my connection to the cluster. This generates me a configuration ($HOME/.databricks-connect) file for Spark to know where to connect and use the given token.
{
"host": "https://adb-1234.azuredatabricks.net",
"token": "eyJ0eXAiXXXXXXXXXXXXXXXXXXXXXx",
"cluster_id": "1234",
"org_id": "1234",
"port": "15001"
}
The issue is that this token does not last very long. When I use spark for more than an hour, I get disconnected because the token is expired.
Is there a way to get a longer token for databricks with a Service Principal ? Since this aim to be for production, I wish my code could generate a PAT for any run, I don't want to create a PAT manually and store it to an Azure Key Vault.
01-16-2023 08:15 AM
I made up an alternative solution. I made up my own python class to handle my PAT from Databricks : https://stackoverflow.com/questions/75071869/python-defaultazurecredential-get-token-set-expiration-...
You can be fancier or even register an atexit inside the class to destroy the PAT. But this will have a side effect. The python process will exit with no error code, but if you have a logger, it will warn you that connection with databricks are closed because of invalid token. Which is "normal", but ugly.
01-11-2023 07:44 AM
There is REST API endpoint to manage tokens:
https://docs.databricks.com/dev-tools/api/latest/token-management.html
So using your code, you get the host and a short token. So all you need to do is construct Rest API, which will generate long-term connections.
Create a token on behalf of a service principal. >> https://docs.databricks.com/dev-tools/api/latest/token-management.html#operation/create-obo-token
01-11-2023 07:53 AM
This issue with this (I think) is that it will create a new token for each run of my code in Azure ML. So if I get over 600 runs, I generate 600 PAT and that's the Databricks limit of PATs. The next ones wont be able to create new tokens and runs would be stucks.
Is there a way to remove "old" PAT for exemple PAT that are older than 24 hours?
I was thinking of a solution that kept the host short token. Every X minutes I ask for a new token, but I have to re init my sparksession and loose all the work. Isn"t a way to inject the token in spark.config ?
Something like this:
spark_session.conf.set("spark.some.option.otken", new_token)
01-11-2023 07:55 AM
there is API calls to delete or manage so you can implement own logic
01-16-2023 08:15 AM
I made up an alternative solution. I made up my own python class to handle my PAT from Databricks : https://stackoverflow.com/questions/75071869/python-defaultazurecredential-get-token-set-expiration-...
You can be fancier or even register an atexit inside the class to destroy the PAT. But this will have a side effect. The python process will exit with no error code, but if you have a logger, it will warn you that connection with databricks are closed because of invalid token. Which is "normal", but ugly.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group