The error you are experiencingโ"UNAUTHENTICATED: Invalid Git provider Personal Access Token credentials for repository URL"โis a common pain point when integrating GitLab repos with Databricks using Personal Access Tokens (PATs), especially for scheduled jobs and automation. While using a PAT can work, it's not as stable as you need because tokens are tied to users, expire, and occasionally become invalid for reasons beyond just manual expiry (e.g., changes in permissions, security policies, or intermittent authentication issues).โ
Alternatives to PATs for Databricks-GitLab Integration
Unfortunately, Databricksโ more robust integration options (such as OAuth-based authentication and service principals) are currently available primarily for GitHub and Azure DevOps, not GitLab. OAuth 2.0 and service principal setups help decouple repo access from individual user accounts and avoid problems with expiring PATs, but as of now, GitLab support for these mechanisms in Databricks is limited or not natively available.โ
Best Practices & Workarounds
-
PAT Management: Databricks currently only supports user-level PATs for GitLab (not project or group tokens), so you must manage these at the user level.โ
-
CI/CD Approach: Many teams set up an automated GitLab runner or pipeline to sync code between GitLab and Databricks, using the Databricks CLI authenticated with environment variables like DATABRICKS_HOST and DATABRICKS_TOKEN. This keeps code sync outside Databricksโ scheduled jobs and is more stable if your pipeline runner manages the tokens securely.โ
-
Databricks CLI Unified Auth: For pipelined code sync, use environment variable authentication (setting DATABRICKS_HOST and DATABRICKS_TOKEN in your CI/CD environment) instead of storing PATs in Databricks โLinked Accounts.โ This is more reliable for automation.โ
-
Re-clone on Failure: As a short-term fix, if you encounter repo authentication errors, deleting and re-cloning the repo in Databricks can reset its internal connection state.โ
Recommended Approach
Given the current limitations:
-
Automate via GitLab CI/CD: Run a job using the Databricks CLI or REST API to sync your code from GitLab to Databricks, using non-interactive tokens stored in your CI/CD environment as environment variables.
-
Centralized Tokens: Create a dedicated โservice accountโ user in GitLab and Databricks, and use that accountโs PAT for automation to reduce the chance of disruptions from individual user actions.
-
Monitor Upstream Updates: Watch for future Databricks announcements regarding OAuth or service principal support for GitLabโthis would enable a fully robust, PAT-free integration.
At this time, there is no built-in way to link a GitLab repo to Databricks for scheduled jobs without using a PAT or equivalent per-user credential. Using CI/CD for code sync and minimizing direct repo linking in Databricks jobs is the best practice for long-term stability.โ