Databricks Community

dkxxx-rc · 5 hours ago

Hi,

It would be nice to be able to clone https://github.com/shap/shap into Databricks, it being such a standard. But it fails because some of the notebooks violate a Databricks size limit. Using sparse clone mode I can force my way through getting most of the repo, but is there another solution?

Thanks.

Ashwin_DSA · 4 hours ago

Hi @dkxxx-rc,

One option besides sparse checkout is to use a Git folder with Git CLI access. The public docs explain that standard Git folders are constrained by per-operation limits, and for larger repos, Databricks recommends either sparse checkout or Git CLI commands.

More specifically, the docs say Git CLI-enabled folders can work with repositories that exceed the 2 GB memory and 4 GB disk limits of standard Git folders, so that’s probably the next thing I’d suggest trying if sparse checkout is only a partial workaround. You can see that called out in Create and manage Git folders and in the Git folder limits page.

One caveat.... you can’t turn on Git CLI support for an existing Git folder, so this would need to be a fresh clone created with Git CLI access enabled.

If the underlying issue is specifically a few oversized notebooks or other large committed files, the docs also note that adding them to .gitignore won’t shrink the repo once they’re already in history. In that case, the more durable fix is to remove those files from history with something like git filter-repo, or split out the problematic content.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

Ashwin_DSA · 4 hours ago

Hi @dkxxx-rc,

One option besides sparse checkout is to use a Git folder with Git CLI access. The public docs explain that standard Git folders are constrained by per-operation limits, and for larger repos, Databricks recommends either sparse checkout or Git CLI commands.

More specifically, the docs say Git CLI-enabled folders can work with repositories that exceed the 2 GB memory and 4 GB disk limits of standard Git folders, so that’s probably the next thing I’d suggest trying if sparse checkout is only a partial workaround. You can see that called out in Create and manage Git folders and in the Git folder limits page.

One caveat.... you can’t turn on Git CLI support for an existing Git folder, so this would need to be a fresh clone created with Git CLI access enabled.

If the underlying issue is specifically a few oversized notebooks or other large committed files, the docs also note that adding them to .gitignore won’t shrink the repo once they’re already in history. In that case, the more durable fix is to remove those files from history with something like git filter-repo, or split out the problematic content.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

dkxxx-rc · 21m ago

This looks sufficient, thanks. I was able to run a clone in the CLI. The folder ends up in an invalid git state, but that appears to be because the beta feature isn't turned on for my workspace yet.

Databricks Community

Work-around for cloning a repo with notebooks too big

FLASH SALE: Save 50% on Summit Training ⚡

DAIS 2026 Speaker Spotlight Series #10 | Christophe Chieu

Community BrickTalk: Using AI to Navigate Unfamiliar Business Data

Solution Accelerator Series | Survival Analysis for Churn and Lifetime Value

🌟 Community Pulse: Your Weekly Roundup! May 11 – 17, 2026