Databricks

Kit · ‎08-11-2022

I have a list of jobs that are using the code in GitHub as source.

Everything worked fine until yesterday. Yesterday, I noticed that all the job that were using GitHub as source were failing. Because of the following error:

```

Run result unavailable: job failed with error message

Checkout remote repository: INTERNAL_ERROR: Failed to checkout internal repo. This workspace already has 9253 repos which exceeds the max limit of 5000 repos

```

However, I checked the Repos folder in my workspace, and there are only <100 of repos. No idea why databricks claimed that I have 9k of repos.

And the number now is >10k. And I didn't created hundreds of new repo in last 24h.

I believe it is databricks issue. What should I do to resolve the issue?

Thanks,

FYI, now I have changed the source of notebook to local repo, and my jobs are running now.

User16766737456 · ‎08-17-2022

Just an update, to round this out.

We investigated further internally, and found that although we have a cleanup process in place to remove the internal repos that are being checked out for workflows, it was failing to catch up due to the sheer volume of jobs that were continuously failing during the repo checkout step (because of an invalid path).

This led to the limits being breached, and cascaded down to valid jobs not being able to launch.

We've worked with Kit to identify the errant job(s), and are now closely monitoring internal metrics, which currently show significant improvements.

View solution in original post

User16766737456 · ‎08-16-2022

Hi, @Kit Yam Tse -- indeed, internally, we count the number of repos in the workspace, and 9253 repos seem high. Can you use the Repos API to get the actual number? (You may need to use `next_page_token`.)

As an example, I use the following Python function to count the number of repos in my workspace. You can modify it to your needs.

    def call_endpoint(self, endpoint, response_key, params=None, pagination_key=None):
        url = f"https://{self.api_host}/{endpoint}"
        response_length = 0
        start_time = time.time()
        if pagination_key:
            if pagination_key == 'next_page_token':
                try:
                    response = self.session.get(url, headers=self.api_headers, params=params)
                    response_length = len(response.json()[response_key])
                    while 'next_page_token' in response.json():
                        params = {
                            'next_page_token': response.json()['next_page_token']
                        }
                        response = self.session.get(url, headers=self.api_headers, params=params)
                        response_length += len(response.json()[response_key])
                except requests.exceptions.RequestException:
                    pass
            elif pagination_key == 'offset':
                try:
                    response = self.session.get(url, headers=self.api_headers, params=params)
                    response_length = len(response.json()[response_key])
                    while response.json()['has_more']:
                        params['offset'] += 25
                        response = self.session.get(url, headers=self.api_headers, params=params)
                        response_length += len(response.json()[response_key])
                except requests.exceptions.RequestException:
                    pass
        else:
            try:
                response = self.session.get(url, headers=self.api_headers, params=params)
                response_length = len(response.json()[response_key]) if isinstance(response.json()[response_key], list) else \
                    response.json()[response_key]
            except requests.exceptions.RequestException:
                pass
        end_time = time.time()
        return {
            'endpoint': endpoint,
            'response_length': response_length,
            'response_time': end_time - start_time
        }

If the count is lower than the limit, and if you have a support contract, please file a support case so we can look further, as we may need more information from you.

Kit · ‎08-16-2022

Thanks Ian,

I only get the first page of the repos list. I can only recognise a few of them, and the rest of the repos are in internal path.

```

"repos": [

{

"id": {{ id }},

"path": "/Repos/.internal/.alias/f/{{ some_values }}/{{ some_values }}",

"url": {{ url }},

"provider": "{{ provider }}",

"head_commit_id": "{{ head_commit_id }}"

},

{

"id": {{ id }},

"path": "/Repos/{{ email }}/{{ repo_name }}",

"url": "{{ url }}",

"provider": "{{ provider }}",

"branch": "{{ branch }}",

"head_commit_id": "{{ head_commit_id }}"

},

{

"id": {{ id }},

"path": "/Repos/.internal/{{ some_values }}_commits/{{ head_commit_id }}",

"url": "{{ url }}",

"provider": "{{ provider }}",

"head_commit_id": "{{ head_commit_id }}"

},

```

I am using git repo as the source of the some scheduled jobs (which run every min). Perhaps these internal repos are created by the scheduled jobs.

Unfortunately, I don't have a support contract yet.

Is there any way I can get without the contract?

User16766737456 · ‎08-16-2022

Thanks, @Kit Yam Tse -- do you have the actual count (including the /Repos/.internal ones, which are, you're correct, for the workflows)?

User16766737456 · ‎08-16-2022

Just to clarify: we count both internal (from workflows, among others) and workspace repos to the 5K count. For workflows where the repos count are exceeded, execution is blocked initially for 10 minutes until the count is reduced. There is a cleanup process for finished tasks as well.

Does the job eventually get executed, or did it completely fail?

This is why it's important to get a complete repos count so we can check if this is the behaviour that you are seeing.

Anonymous · ‎08-16-2022

It seems that I have similar query, did you get the solution for this?

User16766737456 · ‎08-16-2022

Hi, @Priscilla Maynard -- can you please send an email to help@databricks.com with more details? Thanks.

@Kit Yam Tse -- we are checking this internally, and will keep you posted. Thanks for reporting this.

User16766737456 · ‎08-17-2022

Just an update, to round this out.

We investigated further internally, and found that although we have a cleanup process in place to remove the internal repos that are being checked out for workflows, it was failing to catch up due to the sheer volume of jobs that were continuously failing during the repo checkout step (because of an invalid path).

This led to the limits being breached, and cascaded down to valid jobs not being able to launch.

We've worked with Kit to identify the errant job(s), and are now closely monitoring internal metrics, which currently show significant improvements.

Databricks

Can't run a job that use GitHub as source

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI