03-01-2023 11:14 PM
I'm on unity catalog
I'm trying to do a dbt run on a project that works locally
but the databricks dbt workflow task seems to be ignoring the project.yml settings for schemas and catalogs, as well as that defined in the config block of individual models
Is this because of that weird default dbt behaviour that prefixes the default schema with the target schema?
* How do I "name" the target/profile generated by the job as 'prod' or 'uat' instead of 'databricks_cluster'? I can hack around this for now overriding the generate_schema macro, but i do like to use different -targets to change behaviour esp around external locations for different environments (dev/test/prod)
* env vars?
* where are all the dbt generated assets? are they accessible?
dbt docs - where? or do i need to do that manually (please no)
* generic tests?
03-12-2023 05:09 PM
@Jakub K : It's possible that the default behavior of dbt is causing the issue you're experiencing. By default, dbt prefixes the default schema with the target schema name, which can cause confusion when defining schemas and catalogs in the project.yml file or the config block of individual models.
Option 1:
To avoid this behavior, you can set the schema parameter in your profiles.yml file to an empty string, like this:
my_profile:
target: dev
schema: ""
Option 2:
You can also try explicitly setting the schema and database parameters in your project.yml file and in the config block of individual models to ensure that they are being correctly applied. For example:
# project.yml
name: my_project
version: 1.0.0
config-version: 2
vars:
database: my_database
schema: my_schema
# model_name.sql
{{
config(
database=var('database'),
schema=var('schema')
)
}}
SELECT * FROM my_table;
If either of these options dont work, can you please review the dbt logs to see if there are any error messages or warnings that could provide more information
03-12-2023 05:58 PM
Hi @Suteja Kanuri
I'm fine with configuring the profiles when I have full control of the vm which executes it
My questions relate to using the databricks dbt job - https://docs.databricks.com/workflows/jobs/how-to-use-dbt-in-workflows.html
I think I figured out my 1st question - i can't change the databricks_cluster profile name, i have to hardcode it and hope databricks don't change it
env vars - still not sure where or how I'd set them in relation to the dbt_CLI cluster that the workflow job created automatically
getting the assets - the output of the workflow job shows it being created in a randomised location
15:15:24 Concurrency: 8 threads (target='databricks_cluster') 15:15:24 15:15:24 Done. 15:15:24 Building catalog 15:15:29 Catalog written to /tmp/tmp-dbt-run-45551/target/catalog.json
03-13-2023 12:14 AM
@Jakub K :
To set environment variables for your dbt_CLI cluster on Databricks, you can follow these steps:
Once you have set the environment variables for your dbt_CLI cluster, you can access them within your dbt project using the os.environ Python module.
Hope this helps. Please let us know.
03-13-2023 12:17 AM
@Jakub K :
About the error: The output of the workflow job shows that the catalog for the dbt project was written to a temporary location ( /tmp/tmp-dbt-run 45551/target/catalog.json) during the run of the job. This location is typically a temporary directory created by the operating system or the application and is not guaranteed to be persistent or predictable.
It is recommended to write the output of the job to a persistent location such as a cloud storage service like Azure Blob Storage or AWS S3, rather than relying on temporary locations. This ensures that the output is accessible and reliable even after the job has completed.
03-12-2023 05:11 PM
name: my_project
version: 1.0.0
config-version: 2
# Set the name of the target profile
profiles:
my_target_profile:
target_name: prod # or uat, or any name you want
# Set the target in your models
models:
my_model:
schema: my_schema
database: my_database
target: prod # or uat, or any name you want
In this example, we have specified a target profile named my_target_profile with a target_name of 'prod'. We have also set the target parameter in the my_model model to 'prod'. You can change this to 'uat' or any other name you want, depending on your needs.
By setting the target parameter in your models, you can change the behavior of the models based on the environment you're running in.
03-12-2023 05:14 PM
In Databricks, the `target/` directory is where the dbt generates assets is stored in the Databricks File System (DBFS). By default, this directory is located at `/dbfs/tmp/dbt/<target_name>` , where `<target_name>` is the name of the target profile you're running dbt against.
03-12-2023 06:54 PM
IMHO, the naming issue is likely caused by the way you've set up your profile. IIRC, typically when you setup dbt, it creates a profile in ~/.dbt, which is outside the project directory. When you deploy to Databricks, this profile doesn't exist within the project directory, so it uses default settings. The solution is to create a profile inside your project directory that gets committed into your git repository (make sure you don't include your token in this profile). When you setup your dbt job, you can reference this profile with the --profiles-dir setting, eg dbt run --profiles-dir profiles/
03-16-2023 09:55 PM
Hi @Jakub K
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
03-20-2023 08:55 PM
I don't think we're talking about the same thing
the dbt_CLI cluster is created automatically by the databricks dbt job. It's not listed under my managed clusters. I have no visibility over how its created nor any properties or environment variables
Do I need to create my own persisted dbt_job cluster?
04-01-2023 10:11 PM
@Jakub K :
If you do not have visibility or control over the automatically created dbt_CLI cluster in Databricks, and you need to set specific configurations or environment variables for your dbt run, then you may need to create your own persisted cluster.
A persisted cluster is a Databricks cluster that you create and manage, and it persists between dbt job runs. This allows you to have more control over the cluster configurations and environment variables. You can set up your own persisted cluster with the desired configurations and environment variables, and then specify that cluster in your dbt project's target configuration file.
To do this, first create a Databricks cluster with the desired configurations and environment variables. Then, in your dbt project's target configuration file, set the cluster field to the name of the persisted cluster you created. For example:
name: my_databricks_profile
target:
type: databricks
cluster: my_persisted_cluster
platform: databricks
region: <your_region>
workspace_url: <your_workspace_url>
org_id: <your_org_id>
token: <your_token>
Make sure that the name of the persisted cluster in the cluster field matches the name of the cluster you created in Databricks. Once you have set up your persisted cluster and updated your dbt project's target configuration file, you can run your dbt job as usual and it will use the specified persisted cluster for execution.
04-05-2023 08:09 PM
Ok thanks, I suspected so but was hoping there was a way to configure this at the databricks workflow job level for the dbt task
03-21-2023 11:11 PM
Hi @Jakub K
I'm sorry you could not find a solution to your problem in the answers provided.
Our community strives to provide helpful and accurate information, but sometimes an immediate solution may only be available for some issues.
I suggest providing more information about your problem, such as specific error messages, error logs or details about the steps you have taken. This can help our community members better understand the issue and provide more targeted solutions.
Alternatively, you can consider contacting the support team for your product or service. They may be able to provide additional assistance or escalate the issue to the appropriate section for further investigation.
Thank you for your patience and understanding, and please let us know if there is anything else we can do to assist you.
03-22-2023 12:09 AM
Hi @Vidula Khanna
I'm referring to the new managed dbt job type added to databricks recently - https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/how-to/use-dbt-in-workflows
I'm curious about how it works behind the scenes and if there are any restrictions/show stoppers
I sometimes use conditional logic which is based on the target (dev/test/prod) - my first question is can I specify a target for the job? My initial tests show "no" and that its hardcoded to databricks_job. Is this correct?
next, i want to get a copy of the generated docs from this databricks job run - are there any suggested patterns? If I add a command to this databricks job to "dbt docs generate" is there a way for me to get the files programmatically?
environment variables. I didn't see a way to set them onto the autogenerated job cluster. Can I use the managed job cluster with env vars or do I need to create/manage a job cluster for dbt?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group