cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Git integration inconsistencies between git folders and job git

korijn
New Contributor II

It's a little confusing and limiting that the git integration support is inconsistent between the two options available.

Sparse checkout is only supported when using a workspace Git folder, and checking out by commit hash is only supported when using a remote Git source for a job. I want to check out by commit hash, and use sparse checkout!

When using a workspace Git folder to check out a branch, there is actually a risk that you are not getting the version of the code you want to deploy. Imagine a CI/CD scenario where you have merged your changes to master and are now running a pipeline to deploy your code to databricks. As part of the deploy, a workspace Git folder is updated to pull the latest commit from the master branch. While the deploy is running, another pull request is merged into master. Now you are getting different code on databricks than you intended. I want to avoid this risk by checking out by commit hash.

As a separate question, I'm curious if your git checkout uses shallow cloning (no history or full history clone).

2 REPLIES 2

Mounika_Tarigop
Databricks Employee
Databricks Employee

Sparse Checkout: This feature is only supported when using a workspace Git folder. Sparse checkout allows you to clone and work with only a subset of the remote repository's directories, which is useful for managing large repositories.

Checking Out by Commit Hash: This feature is only supported when using a remote Git source for a job. Checking out by commit hash ensures that you are working with a specific version of the code, which is crucial for maintaining consistency, especially in CI/CD scenarios.

Unfortunately, due to the current limitations, you cannot combine sparse checkout with checking out by commit hash directly within the Databricks workspace Git folder.

To mitigate this risk, you might consider the following workaround:

  • Use Remote Git Source for Jobs: Configure your jobs to use a remote Git source and specify the commit hash you want to check out. This ensures that the exact version of the code is used during deployment.
  • Manual Sparse Checkout: Perform sparse checkout operations manually outside of Databricks and then push the relevant subset of the repository to a new branch or repository that Databricks can then use.

Thank you for confirming the information in the opening post as accurate. The workarounds are not acceptable so feel free to close this issue or you can leave it open until a more mature solution is released on the Databricks platform.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group