cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Work-around for cloning a repo with notebooks too big

dkxxx-rc
Contributor

Hi,

It would be nice to be able to clone https://github.com/shap/shap into Databricks, it being such a standard.  But it fails because some of the notebooks violate a Databricks size limit.  Using sparse clone mode I can force my way through getting most of the repo, but is there another solution?

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @dkxxx-rc,

One option besides sparse checkout is to use a Git folder with Git CLI access. The public docs explain that standard Git folders are constrained by per-operation limits, and for larger repos, Databricks recommends either sparse checkout or Git CLI commands.

More specifically, the docs say Git CLI-enabled folders can work with repositories that exceed the 2 GB memory and 4 GB disk limits of standard Git folders, so thatโ€™s probably the next thing Iโ€™d suggest trying if sparse checkout is only a partial workaround. You can see that called out in Create and manage Git folders and in the Git folder limits page.

One caveat.... you canโ€™t turn on Git CLI support for an existing Git folder, so this would need to be a fresh clone created with Git CLI access enabled.

If the underlying issue is specifically a few oversized notebooks or other large committed files, the docs also note that adding them to .gitignore wonโ€™t shrink the repo once theyโ€™re already in history. In that case, the more durable fix is to remove those files from history with something like git filter-repo, or split out the problematic content.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

2 REPLIES 2

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @dkxxx-rc,

One option besides sparse checkout is to use a Git folder with Git CLI access. The public docs explain that standard Git folders are constrained by per-operation limits, and for larger repos, Databricks recommends either sparse checkout or Git CLI commands.

More specifically, the docs say Git CLI-enabled folders can work with repositories that exceed the 2 GB memory and 4 GB disk limits of standard Git folders, so thatโ€™s probably the next thing Iโ€™d suggest trying if sparse checkout is only a partial workaround. You can see that called out in Create and manage Git folders and in the Git folder limits page.

One caveat.... you canโ€™t turn on Git CLI support for an existing Git folder, so this would need to be a fresh clone created with Git CLI access enabled.

If the underlying issue is specifically a few oversized notebooks or other large committed files, the docs also note that adding them to .gitignore wonโ€™t shrink the repo once theyโ€™re already in history. In that case, the more durable fix is to remove those files from history with something like git filter-repo, or split out the problematic content.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

dkxxx-rc
Contributor

This looks sufficient, thanks.  I was able to run a clone in the CLI.  The folder ends up in an invalid git state, but that appears to be because the beta feature isn't turned on for my workspace yet.