06-28-2024 08:01 AM - edited 06-28-2024 08:19 AM
I am attempting to create a compute cluster using the Python SDK while sourcing a cluster-create configuration JSON file, which is how it's done for the databricks-cli and what databricks provides through the GUI. Reading in the JSON as a Dict fails due to the assumption in the SDK that the arguments are of specific DataClass types, e.g.:
> if autoscale is not None: body['autoscale'] = autoscale.as_dict()
E AttributeError: 'dict' object has no attribute 'as_dict'
This is the pattern of the call I'm making:
from databricks.sdk import WorkspaceClient
db_client = WorkspaceClient()
with open("my/path/to/cluster-create.json") as file:
create_config = json.load(file)
db_client.clusters.create_and_wait(**create_config)
I've attempted to look around in the SDK to see if there's a bootstrapping function but haven't found anything. I can certainly work around this situation, but it's a bit cumbersome so hoping the community can help point me to the magic-method I'm looking for.
Appreciated!
07-01-2024 05:45 AM - edited 07-01-2024 05:50 AM
@Retired_mod The structure of the `cluster-create.json` is perfectly fine. The issue is as stated above related to the structure is that the SDK does not allow nested structures from the JSON file to be used, and instead they need to be cast to specific Python dataclasses.
Here's what I came up with to get around the situation:
def create_compute_cluster(db_client: WorkspaceClient, cluster_conf: dict) -> str:
cc = CreateCluster.from_dict(cluster_conf)
refactored_input = dict()
for field in list(cc.__dataclass_fields__.keys()):
refactored_input[field] = cc.__getattribute__(field)
return db_client.clusters.create_and_wait(**refactored_input, timeout=CLUSTER_UP_TIMEOUT)
I could also see the function reading the json file more like this:
def create_compute_cluster(db_client: WorkspaceClient, create_config_path: dict) -> str:
with open(create_config_path) as file:
create_config = json.load(file)
cc = CreateCluster.from_dict(create_config)
refactored_input = dict()
for field in list(cc.__dataclass_fields__.keys()):
refactored_input[field] = cc.__getattribute__(field)
return db_client.clusters.create_and_wait(**refactored_input, timeout=CLUSTER_UP_TIMEOUT)
What may make sense is some additional functions in ClustersAPI class unless overloading is preferred using multipledispatch. All this assumes there's a need outside my own to do this type of pattern. 🤷
06-28-2024 08:19 AM
I have also posted an issue in the SDK github repo: [ISSUE] clusters.create_and_wait not accepting dict-input from configuration file. · Issue #690 · da...
07-01-2024 05:45 AM - edited 07-01-2024 05:50 AM
@Retired_mod The structure of the `cluster-create.json` is perfectly fine. The issue is as stated above related to the structure is that the SDK does not allow nested structures from the JSON file to be used, and instead they need to be cast to specific Python dataclasses.
Here's what I came up with to get around the situation:
def create_compute_cluster(db_client: WorkspaceClient, cluster_conf: dict) -> str:
cc = CreateCluster.from_dict(cluster_conf)
refactored_input = dict()
for field in list(cc.__dataclass_fields__.keys()):
refactored_input[field] = cc.__getattribute__(field)
return db_client.clusters.create_and_wait(**refactored_input, timeout=CLUSTER_UP_TIMEOUT)
I could also see the function reading the json file more like this:
def create_compute_cluster(db_client: WorkspaceClient, create_config_path: dict) -> str:
with open(create_config_path) as file:
create_config = json.load(file)
cc = CreateCluster.from_dict(create_config)
refactored_input = dict()
for field in list(cc.__dataclass_fields__.keys()):
refactored_input[field] = cc.__getattribute__(field)
return db_client.clusters.create_and_wait(**refactored_input, timeout=CLUSTER_UP_TIMEOUT)
What may make sense is some additional functions in ClustersAPI class unless overloading is preferred using multipledispatch. All this assumes there's a need outside my own to do this type of pattern. 🤷
Saturday
Hello @tseader
I am also facing same error when calling the clusters.create_and_wait function.
in your solution, in function create_cluster_compute, can you please let me know what is CreateCluster ? What value does this contain? When I try to run the code i am getting error: Name error: Name CreateCluster not defined.
I am referring to this line of code.
cc = CreateCluster.from_dict(create_config)
Thanks
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group