Databricks Community

korijn · ‎11-28-2024

It's a little confusing and limiting that the git integration support is inconsistent between the two options available.

Sparse checkout is only supported when using a workspace Git folder, and checking out by commit hash is only supported when using a remote Git source for a job. I want to check out by commit hash, and use sparse checkout!

When using a workspace Git folder to check out a branch, there is actually a risk that you are not getting the version of the code you want to deploy. Imagine a CI/CD scenario where you have merged your changes to master and are now running a pipeline to deploy your code to databricks. As part of the deploy, a workspace Git folder is updated to pull the latest commit from the master branch. While the deploy is running, another pull request is merged into master. Now you are getting different code on databricks than you intended. I want to avoid this risk by checking out by commit hash.

As a separate question, I'm curious if your git checkout uses shallow cloning (no history or full history clone).

Mounika_Tarigop · ‎12-06-2024

Sparse Checkout: This feature is only supported when using a workspace Git folder. Sparse checkout allows you to clone and work with only a subset of the remote repository's directories, which is useful for managing large repositories.

Checking Out by Commit Hash: This feature is only supported when using a remote Git source for a job. Checking out by commit hash ensures that you are working with a specific version of the code, which is crucial for maintaining consistency, especially in CI/CD scenarios.

Unfortunately, due to the current limitations, you cannot combine sparse checkout with checking out by commit hash directly within the Databricks workspace Git folder.

To mitigate this risk, you might consider the following workaround:

Use Remote Git Source for Jobs: Configure your jobs to use a remote Git source and specify the commit hash you want to check out. This ensures that the exact version of the code is used during deployment.
Manual Sparse Checkout: Perform sparse checkout operations manually outside of Databricks and then push the relevant subset of the repository to a new branch or repository that Databricks can then use.

korijn · ‎01-08-2025

Thank you for confirming the information in the opening post as accurate. The workarounds are not acceptable so feel free to close this issue or you can leave it open until a more mature solution is released on the Databricks platform.

Databricks Community

Git integration inconsistencies between git folders and job git

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!