cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

GitLab Integration

flodoamaral
New Contributor

Hello ๐Ÿ‘‹

I'm struggling with Gitlab integration in databricks.
I've got jobs that run on a daily basis, pointing directly to .py files in my repo. In order to do so, my gitlab account is linked to databricks with a PAT expiring within a month.

But every other day (at least once a week), I get the same error when scheduled jobs run:
Failed to checkout Git repository: UNAUTHENTICATED: Invalid Git provider Personal Access Token credentials for repository URL.

This happens even though PAT is not expired.
Usually I simply renew my PAT and update my linked account in databricks settings. 

Is there a better way to link my repo to our databricks instance? 
Ideally it wouldn't be through a PAT as it depends on user, and I'd like it to be more stable.

Thanks in advance for your help, 
Florian

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

The error you are experiencingโ€”"UNAUTHENTICATED: Invalid Git provider Personal Access Token credentials for repository URL"โ€”is a common pain point when integrating GitLab repos with Databricks using Personal Access Tokens (PATs), especially for scheduled jobs and automation. While using a PAT can work, it's not as stable as you need because tokens are tied to users, expire, and occasionally become invalid for reasons beyond just manual expiry (e.g., changes in permissions, security policies, or intermittent authentication issues).โ€‹

Alternatives to PATs for Databricks-GitLab Integration

Unfortunately, Databricksโ€™ more robust integration options (such as OAuth-based authentication and service principals) are currently available primarily for GitHub and Azure DevOps, not GitLab. OAuth 2.0 and service principal setups help decouple repo access from individual user accounts and avoid problems with expiring PATs, but as of now, GitLab support for these mechanisms in Databricks is limited or not natively available.โ€‹

Best Practices & Workarounds

  • PAT Management: Databricks currently only supports user-level PATs for GitLab (not project or group tokens), so you must manage these at the user level.โ€‹

  • CI/CD Approach: Many teams set up an automated GitLab runner or pipeline to sync code between GitLab and Databricks, using the Databricks CLI authenticated with environment variables like DATABRICKS_HOST and DATABRICKS_TOKEN. This keeps code sync outside Databricksโ€™ scheduled jobs and is more stable if your pipeline runner manages the tokens securely.โ€‹

  • Databricks CLI Unified Auth: For pipelined code sync, use environment variable authentication (setting DATABRICKS_HOST and DATABRICKS_TOKEN in your CI/CD environment) instead of storing PATs in Databricks โ€œLinked Accounts.โ€ This is more reliable for automation.โ€‹

  • Re-clone on Failure: As a short-term fix, if you encounter repo authentication errors, deleting and re-cloning the repo in Databricks can reset its internal connection state.โ€‹

Recommended Approach

Given the current limitations:

  1. Automate via GitLab CI/CD: Run a job using the Databricks CLI or REST API to sync your code from GitLab to Databricks, using non-interactive tokens stored in your CI/CD environment as environment variables.

  2. Centralized Tokens: Create a dedicated โ€œservice accountโ€ user in GitLab and Databricks, and use that accountโ€™s PAT for automation to reduce the chance of disruptions from individual user actions.

  3. Monitor Upstream Updates: Watch for future Databricks announcements regarding OAuth or service principal support for GitLabโ€”this would enable a fully robust, PAT-free integration.

At this time, there is no built-in way to link a GitLab repo to Databricks for scheduled jobs without using a PAT or equivalent per-user credential. Using CI/CD for code sync and minimizing direct repo linking in Databricks jobs is the best practice for long-term stability.โ€‹