cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

dbt workflow job limitations - naming the target? where do docs go?

jakubk
Contributor

I'm on unity catalog

I'm trying to do a dbt run on a project that works locally

but the databricks dbt workflow task seems to be ignoring the project.yml settings for schemas and catalogs, as well as that defined in the config block of individual models

Is this because of that weird default dbt behaviour that prefixes the default schema with the target schema?

* How do I "name" the target/profile generated by the job as 'prod' or 'uat' instead of 'databricks_cluster'? I can hack around this for now overriding the generate_schema macro, but i do like to use different -targets to change behaviour esp around external locations for different environments (dev/test/prod)

* env vars?

* where are all the dbt generated assets? are they accessible?

dbt docs - where? or do i need to do that manually (please no)

* generic tests?

13 REPLIES 13

Anonymous
Not applicable

@Jakub K​ : It's possible that the default behavior of dbt is causing the issue you're experiencing. By default, dbt prefixes the default schema with the target schema name, which can cause confusion when defining schemas and catalogs in the project.yml file or the config block of individual models.

Option 1:

To avoid this behavior, you can set the schema parameter in your profiles.yml file to an empty string, like this:

my_profile:
  target: dev
  schema: ""

Option 2:

You can also try explicitly setting the schema and database parameters in your project.yml file and in the config block of individual models to ensure that they are being correctly applied. For example:

# project.yml
name: my_project
version: 1.0.0
config-version: 2
 
vars:
  database: my_database
  schema: my_schema
 
# model_name.sql
{{
  config(
    database=var('database'),
    schema=var('schema')
  )
}}
 
SELECT * FROM my_table;

If either of these options dont work, can you please review the dbt logs to see if there are any error messages or warnings that could provide more information

Hi @Suteja Kanuri​ 

I'm fine with configuring the profiles when I have full control of the vm which executes it

My questions relate to using the databricks dbt job - https://docs.databricks.com/workflows/jobs/how-to-use-dbt-in-workflows.html

I think I figured out my 1st question - i can't change the databricks_cluster profile name, i have to hardcode it and hope databricks don't change it

env vars - still not sure where or how I'd set them in relation to the dbt_CLI cluster that the workflow job created automatically

getting the assets - the output of the workflow job shows it being created in a randomised location

15:15:24 Concurrency: 8 threads (target='databricks_cluster') 15:15:24 15:15:24 Done. 15:15:24 Building catalog 15:15:29 Catalog written to /tmp/tmp-dbt-run-45551/target/catalog.json

Anonymous
Not applicable

@Jakub K​ :

To set environment variables for your dbt_CLI cluster on Databricks, you can follow these steps:

  1. Navigate to your Databricks workspace and click on the "Clusters" tab.
  2. Click on the name of the dbt_CLI cluster that you want to set environment variables for.
  3. Click on the "Edit" button to open the cluster configuration page.
  4. Scroll down to the "Advanced Options" section and click on "Spark" to expand the settings.
  5. In the "Environment Variables" section, click on "Add" to add a new environment variable.
  6. Enter the name and value for the environment variable that you want to set, and click on "OK" to save the changes.
  7. Repeat steps 5 and 6 to add additional environment variables, if necessary.
  8. Click on "Save" to save the changes to the cluster configuration

Once you have set the environment variables for your dbt_CLI cluster, you can access them within your dbt project using the os.environ Python module.

Hope this helps. Please let us know.

Anonymous
Not applicable

@Jakub K​ :

About the error: The output of the workflow job shows that the catalog for the dbt project was written to a temporary location ( /tmp/tmp-dbt-run 45551/target/catalog.json) during the run of the job. This location is typically a temporary directory created by the operating system or the application and is not guaranteed to be persistent or predictable.

It is recommended to write the output of the job to a persistent location such as a cloud storage service like Azure Blob Storage or AWS S3, rather than relying on temporary locations. This ensures that the output is accessible and reliable even after the job has completed.

Anonymous
Not applicable
name: my_project
version: 1.0.0
config-version: 2
 
# Set the name of the target profile
profiles:
  my_target_profile:
    target_name: prod  # or uat, or any name you want
 
# Set the target in your models
models:
  my_model:
    schema: my_schema
    database: my_database
    target: prod  # or uat, or any name you want

In this example, we have specified a target profile named my_target_profile with a target_name of 'prod'. We have also set the target parameter in the my_model model to 'prod'. You can change this to 'uat' or any other name you want, depending on your needs.

By setting the target parameter in your models, you can change the behavior of the models based on the environment you're running in.

Anonymous
Not applicable

In Databricks, the `target/` directory is where the dbt generates assets is stored in the Databricks File System (DBFS). By default, this directory is located at `/dbfs/tmp/dbt/<target_name>` , where `<target_name>` is the name of the target profile you're running dbt against.

najmead
Contributor

IMHO, the naming issue is likely caused by the way you've set up your profile. IIRC, typically when you setup dbt, it creates a profile in ~/.dbt, which is outside the project directory. When you deploy to Databricks, this profile doesn't exist within the project directory, so it uses default settings. The solution is to create a profile inside your project directory that gets committed into your git repository (make sure you don't include your token in this profile). When you setup your dbt job, you can reference this profile with the --profiles-dir setting, eg dbt run --profiles-dir profiles/

Anonymous
Not applicable

Hi @Jakub K​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

I don't think we're talking about the same thing

the dbt_CLI cluster is created automatically by the databricks dbt job. It's not listed under my managed clusters. I have no visibility over how its created nor any properties or environment variables

Do I need to create my own persisted dbt_job cluster?

Anonymous
Not applicable

@Jakub K​ :

If you do not have visibility or control over the automatically created dbt_CLI cluster in Databricks, and you need to set specific configurations or environment variables for your dbt run, then you may need to create your own persisted cluster.

A persisted cluster is a Databricks cluster that you create and manage, and it persists between dbt job runs. This allows you to have more control over the cluster configurations and environment variables. You can set up your own persisted cluster with the desired configurations and environment variables, and then specify that cluster in your dbt project's target configuration file.

To do this, first create a Databricks cluster with the desired configurations and environment variables. Then, in your dbt project's target configuration file, set the cluster field to the name of the persisted cluster you created. For example:

name: my_databricks_profile
target:
  type: databricks
  cluster: my_persisted_cluster
  platform: databricks
  region: <your_region>
  workspace_url: <your_workspace_url>
  org_id: <your_org_id>
  token: <your_token>

Make sure that the name of the persisted cluster in the cluster field matches the name of the cluster you created in Databricks. Once you have set up your persisted cluster and updated your dbt project's target configuration file, you can run your dbt job as usual and it will use the specified persisted cluster for execution.

Ok thanks, I suspected so but was hoping there was a way to configure this at the databricks workflow job level for the dbt task

Anonymous
Not applicable

Hi @Jakub K​ 

I'm sorry you could not find a solution to your problem in the answers provided.

Our community strives to provide helpful and accurate information, but sometimes an immediate solution may only be available for some issues.

I suggest providing more information about your problem, such as specific error messages, error logs or details about the steps you have taken. This can help our community members better understand the issue and provide more targeted solutions.

Alternatively, you can consider contacting the support team for your product or service. They may be able to provide additional assistance or escalate the issue to the appropriate section for further investigation.

Thank you for your patience and understanding, and please let us know if there is anything else we can do to assist you.

Hi @Vidula Khanna​ 

I'm referring to the new managed dbt job type added to databricks recently - https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/how-to/use-dbt-in-workflows

I'm curious about how it works behind the scenes and if there are any restrictions/show stoppers

I sometimes use conditional logic which is based on the target (dev/test/prod) - my first question is can I specify a target for the job? My initial tests show "no" and that its hardcoded to databricks_job. Is this correct?

next, i want to get a copy of the generated docs from this databricks job run - are there any suggested patterns? If I add a command to this databricks job to "dbt docs generate" is there a way for me to get the files programmatically?

environment variables. I didn't see a way to set them onto the autogenerated job cluster. Can I use the managed job cluster with env vars or do I need to create/manage a job cluster for dbt?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!