cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Serverless ML

Daniele-T
New Contributor

Hello,

I'm trying to set up a DAB job that runs an ML job. For this it would be useful to use a serverless ML environment, that I can select in notebooks. Anyway, I do not find a meaningful way to define the base environment as ML.

I do not want to give the requirements-ML.txt, as I think there would be a larger start-up time. I could not find any useful documentation for it.

I tried something like

  environments:
    - environment_key: default
      spec:
        environment_version: "5"
        base_environment: "ML"
        dependencies:
          - ... light dependencies

but it does expect a yml file on base environment.

Does anybody has a tip?

Thank you,

Daniele

 
1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @Daniele-T,

Thanks for sharing the exact error. It confirms this is not a quoting or syntax issue on your end.

This is more of a platform limitation, as far as I can tell. In the Jobs/DAB deployment path, base_environment is only accepted when it points to a custom environment spec file stored in Workspace files or a Unity Catalog Volume, for example, /Workspace/Shared/envs/ml-env.yml or /Volumes/my-catalog/my-schema/envs/ml-env.yml. Managed identifiers such as databricks_ml_v5 are rejected at the API layer regardless of the CLI version. I tested this in my own sandbox and reproduced the exact same error you hit, so this is consistent behaviour and not something specific to your setup. The serverless environment docs also note that if the base-environment preview is not enabled in a workspace, jobs expose environment_version rather than base_environment, and the Custom option expects a YAML file path.

So, for a scheduled Python script job deployed via Bundles and the CLI, you have two supported paths. The simpler one is to drop base_environment entirely and use environment_version with a small dependency list directly in your databricks.yml:

environments:
- environment_key: default
spec:
environment_version: "5"
dependencies:
- pandas==2.2.2
- scikit-learn==1.5.1

The second option, if you prefer to keep the environment spec separate from your bundle config, is to put the same spec in a YAML file, upload it to Workspace files or a UC Volume, and reference it via an absolute path:

environments:
- environment_key: default
spec:
base_environment: /Workspace/Users/your-user@company.com/envs/ml-env.yml

One important caveat on the second path. It is not equivalent to getting the Databricks ML runtime. The YAML file resolves to the same environment_version + dependencies mechanism, so you still need to list your packages explicitly. There is no way today to reference databricks_ml_v5 through the YAML path either.

On the startup time concern, it is worth knowing that serverless environments are cached, so after the first cold start, subsequent runs sharing the same dependency fingerprint will not reinstall. For a light dependency set, the overhead is smaller than it might seem. 

If your intention is specifically to get the full Databricks ML runtime pre-loaded (MLflow, Delta, the full ML stack) without listing packages, that is not supported in the DAB/CLI deployment path as of now. You would need to add those packages explicitly to your dependencies list for now.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

4 REPLIES 4

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @Daniele-T,

ML is the UI label for a Databricks-managed serverless base environment, but in Jobs/DAB you generally need to use the job environment model rather than the notebook picker directly. Databricks documents that the serverless base environment options include Standard, ML, AI, previous versions, Custom (YAML), and workspace environments, and that job tasks are configured through the job environment settings.

If this is a notebook task, the simplest option may be to let the task use the notebook's own environment, because notebook tasks default to Notebook Environment unless you override them with a job-level environment. See Configure the serverless environment.

If you do want to define the environment in DAB, base_environment: "ML" is not the right value. For managed base environments, Databricks uses versioned identifiers, and Databricks-provided ML environments are versioned like databricks_ml_v5. The public environment APIs also describe Databricks-provided ML base environments as workspace-base-environments/databricks_ml_..., for example workspace-base-environments/databricks_ml_v5.

So the configuration should look more like this:

environments:
  - environment_key: default
    spec:
      base_environment: databricks_ml_v5
      dependencies:
        - ...

In that case, do not also set environment_version in the same spec.

A second thing to check is workspace support. Databricks notes that selecting a managed base environment for jobs is in beta, and that if the workspace does not have that feature enabled, the job configuration shows an Environment version drop-down instead of Base environment. In those workspaces, the "Custom" option expects a YAML file, which matches what you are seeing. See Configure the serverless environment.

So... 

  • If this is a notebook task, consider configuring the notebook itself to use the ML base environment and let the job use Notebook Environment by default.
  • If you want a job-level environment and your workspace supports managed base environments for jobs, use base_environment: databricks_ml_v5 plus your light dependencies.
  • If your workspace does not support managed base environments for jobs yet, then the supported routes are either environment_version plus dependencies, or a Custom YAML-based environment.

Hope this helps.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

Daniele-T
New Contributor

Hello @Ashwin_DSA ,

Thanks for your quick reply. My aim is to use it in a scheduled job (python script) deployed via the cli.

When I tried like you suggested:

  environments:
    - environment_key: default
      spec:
        base_environment: databricks_ml_v5

with base_environment either quoted or unquoted -- I get this exception:

Error: cannot update job: Invalid base environment for 'default'. Only custom base environments (Workspace or Volume absolute paths ending with '.yaml' or '.yml') are currently supported.

I'm currently deploying via databricks cli=1.2.1

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @Daniele-T,

Thanks for sharing the exact error. It confirms this is not a quoting or syntax issue on your end.

This is more of a platform limitation, as far as I can tell. In the Jobs/DAB deployment path, base_environment is only accepted when it points to a custom environment spec file stored in Workspace files or a Unity Catalog Volume, for example, /Workspace/Shared/envs/ml-env.yml or /Volumes/my-catalog/my-schema/envs/ml-env.yml. Managed identifiers such as databricks_ml_v5 are rejected at the API layer regardless of the CLI version. I tested this in my own sandbox and reproduced the exact same error you hit, so this is consistent behaviour and not something specific to your setup. The serverless environment docs also note that if the base-environment preview is not enabled in a workspace, jobs expose environment_version rather than base_environment, and the Custom option expects a YAML file path.

So, for a scheduled Python script job deployed via Bundles and the CLI, you have two supported paths. The simpler one is to drop base_environment entirely and use environment_version with a small dependency list directly in your databricks.yml:

environments:
- environment_key: default
spec:
environment_version: "5"
dependencies:
- pandas==2.2.2
- scikit-learn==1.5.1

The second option, if you prefer to keep the environment spec separate from your bundle config, is to put the same spec in a YAML file, upload it to Workspace files or a UC Volume, and reference it via an absolute path:

environments:
- environment_key: default
spec:
base_environment: /Workspace/Users/your-user@company.com/envs/ml-env.yml

One important caveat on the second path. It is not equivalent to getting the Databricks ML runtime. The YAML file resolves to the same environment_version + dependencies mechanism, so you still need to list your packages explicitly. There is no way today to reference databricks_ml_v5 through the YAML path either.

On the startup time concern, it is worth knowing that serverless environments are cached, so after the first cold start, subsequent runs sharing the same dependency fingerprint will not reinstall. For a light dependency set, the overhead is smaller than it might seem. 

If your intention is specifically to get the full Databricks ML runtime pre-loaded (MLflow, Delta, the full ML stack) without listing packages, that is not supported in the DAB/CLI deployment path as of now. You would need to add those packages explicitly to your dependencies list for now.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

balajij8
Contributor III

You can create the environment in the DAB

environments:
        - environment_key: default
          spec:
            environment_version: "5"
            dependencies:
              - xgboost
              # Add

You can also create a file with packages and use it in base environment in the DAB

  • File with packages - yaml
environment_version: '5'
dependencies:
  - xgboost>=2.0.0
  # Add libraries
  • Use it in base environment in the DAB
environments:
  - environment_key: default
    spec:
      base_environment: /Workspace/mlp/env.yaml