cancel
Showing results for 
Search instead for 
Did you mean: 

Does "databricks bundle deploy" clean up old files?

xhead
New Contributor II

I'm looking at this page (Databricks Asset Bundles development work tasks) in the Databricks documentation.

When repo assets are deployed to a databricks workspace, it is not clear if the "databricks bundle deploy" will remove files from the target workspace that aren't in the source repo. For example, if a repo contained a notebook named "test1.py" and had been deployed, but then "test1.py" was removed from the repo and a new notebook "test2.py" was created, what is the content of the target workspace after? I believe it will contain both "test1.py" and "test2.py".

Secondly, the description of "databricks bundle destroy" does not indicate that it would remove all files from the workspace - only that it will remove all the artifacts referenced by the bundle. So when the "test1.py" file has been removed from the repo, and the "databricks bundle destroy" is run, will it only remove "test2.py" (which has not yet been deployed)?

I am trying to determine how to ensure that the shared workspace contains only the files that are in the repo - that whatever I do in a release pipeline, I will only have the latest assets in the workspace that are in the repo, and none of the old files that were previously in the repo.

The semantics of "databricks bundle deploy" (in particular the term "deploy") would indicate to me that it should do a clean up of assets in the target workspace as part of the deployment.

But if that is not the case, then if I did a "databricks bundle destroy" prior to the "databricks bundle deploy", would that adequately clean up the target workspace? Or do I need to do something with "databricks fs rm" to delete all the files in the target workspace folder prior to the bundle deploy?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @xhead , 

When deploying repo assets to a Databricks workspace using the “databricks bundle deploy” command, it’s essential to understand how it interacts with existing files in the target workspace.

Let’s address your concerns:

  1. The behaviour of “databricks bundle deploy”:

    • Deploying a bundle does not automatically remove files from the target workspace that are no longer present in the source repo.
    • For example, if you initially deployed a notebook named “test1.py”, and later removed it from the repo while adding a new notebook “test2.py”, both “test1.py” and “test2.py” will coexist in the target workspace.
    • The deployment process is additive, not subtractive.
  2. “databricks bundle destroy”:

    • The purpose of “databricks bundle destroy” is to remove all previously-deployed jobs, pipelines, and artifacts that are defined in the bundle configuration files.
    • However, it does not remove other files in the workspace that were not part of the bundle.
    • So, if you run “databricks bundle destroy” after removing “test1.py” from the repo, it will only remove artifacts referenced by the bundle (if any), not other files like “test2.py”.
  3. Ensuring Workspace Consistency:

    • To ensure that your shared workspace contains only the latest assets from the repo, consider the following steps:
      • Manual Cleanup:
        • Before deploying a new bundle, manually delete any old files in the workspace that are no longer part of the repo.
      • Pre-Bundle Cleanup:
        • Run “databricks bundle destroy” before deploying a new bundle. This will remove previously-deployed artifacts.
        • Optionally, use “databricks fs rm” to delete specific files or folders in the target workspace.
      • Automated Pipeline:
        • In your release pipeline, consider scripting the cleanup process.
        • Compare the files in the repo with those in the workspace and remove any discrepancies.
  4. Semantic Implications:

    • While the term “deploy” might imply cleanup, it focuses on adding or updating resources rather than removing existing ones.
    • The responsibility for workspace consistency lies with the user.

Remember to tailor your approach based on your specific requirements and workflow. Happy bundling!

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @xhead , 

When deploying repo assets to a Databricks workspace using the “databricks bundle deploy” command, it’s essential to understand how it interacts with existing files in the target workspace.

Let’s address your concerns:

  1. The behaviour of “databricks bundle deploy”:

    • Deploying a bundle does not automatically remove files from the target workspace that are no longer present in the source repo.
    • For example, if you initially deployed a notebook named “test1.py”, and later removed it from the repo while adding a new notebook “test2.py”, both “test1.py” and “test2.py” will coexist in the target workspace.
    • The deployment process is additive, not subtractive.
  2. “databricks bundle destroy”:

    • The purpose of “databricks bundle destroy” is to remove all previously-deployed jobs, pipelines, and artifacts that are defined in the bundle configuration files.
    • However, it does not remove other files in the workspace that were not part of the bundle.
    • So, if you run “databricks bundle destroy” after removing “test1.py” from the repo, it will only remove artifacts referenced by the bundle (if any), not other files like “test2.py”.
  3. Ensuring Workspace Consistency:

    • To ensure that your shared workspace contains only the latest assets from the repo, consider the following steps:
      • Manual Cleanup:
        • Before deploying a new bundle, manually delete any old files in the workspace that are no longer part of the repo.
      • Pre-Bundle Cleanup:
        • Run “databricks bundle destroy” before deploying a new bundle. This will remove previously-deployed artifacts.
        • Optionally, use “databricks fs rm” to delete specific files or folders in the target workspace.
      • Automated Pipeline:
        • In your release pipeline, consider scripting the cleanup process.
        • Compare the files in the repo with those in the workspace and remove any discrepancies.
  4. Semantic Implications:

    • While the term “deploy” might imply cleanup, it focuses on adding or updating resources rather than removing existing ones.
    • The responsibility for workspace consistency lies with the user.

Remember to tailor your approach based on your specific requirements and workflow. Happy bundling!

xhead
New Contributor II

One further question:

  • The purpose of “databricks bundle destroy” is to remove all previously-deployed jobs, pipelines, and artifacts that are defined in the bundle configuration files.

Which bundle configuration files? The ones in the repo? Or are there bundle configuration files in the target workspace location that are used? If the previous version of the bundle contained a reference to test1.py and it has been deployed to a shared workspace, and the new version of the repo no longer contains test1.py, will the destroy command remove test1.py from the shared workspace? 

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.