cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Can't run a job that use GitHub as source

Kit
New Contributor III

I have a list of jobs that are using the code in GitHub as source.

Everything worked fine until yesterday. Yesterday, I noticed that all the job that were using GitHub as source were failing. Because of the following error:

``` 

Run result unavailable: job failed with error message 

Checkout remote repository: INTERNAL_ERROR: Failed to checkout internal repo. This workspace already has 9253 repos which exceeds the max limit of 5000 repos 

``` 

However, I checked the Repos folder in my workspace, and there are only <100 of repos. No idea why databricks claimed that I have 9k of repos.

And the number now is >10k. And I didn't created hundreds of new repo in last 24h.

I believe it is databricks issue. What should I do to resolve the issue?

Thanks,

FYI, now I have changed the source of notebook to local repo, and my jobs are running now.

1 ACCEPTED SOLUTION

Accepted Solutions

User16766737456
New Contributor III
New Contributor III

Just an update, to round this out.

We investigated further internally, and found that although we have a cleanup process in place to remove the internal repos that are being checked out for workflows, it was failing to catch up due to the sheer volume of jobs that were continuously failing during the repo checkout step (because of an invalid path).

This led to the limits being breached, and cascaded down to valid jobs not being able to launch.

We've worked with Kit to identify the errant job(s), and are now closely monitoring internal metrics, which currently show significant improvements.

View solution in original post

7 REPLIES 7

User16766737456
New Contributor III
New Contributor III

Hi, @Kit Yam Tseโ€‹ -- indeed, internally, we count the number of repos in the workspace, and 9253 repos seem high. Can you use the Repos API to get the actual number? (You may need to use `next_page_token`.)

As an example, I use the following Python function to count the number of repos in my workspace. You can modify it to your needs.

    def call_endpoint(self, endpoint, response_key, params=None, pagination_key=None):
        url = f"https://{self.api_host}/{endpoint}"
        response_length = 0
        start_time = time.time()
        if pagination_key:
            if pagination_key == 'next_page_token':
                try:
                    response = self.session.get(url, headers=self.api_headers, params=params)
                    response_length = len(response.json()[response_key])
                    while 'next_page_token' in response.json():
                        params = {
                            'next_page_token': response.json()['next_page_token']
                        }
                        response = self.session.get(url, headers=self.api_headers, params=params)
                        response_length += len(response.json()[response_key])
                except requests.exceptions.RequestException:
                    pass
            elif pagination_key == 'offset':
                try:
                    response = self.session.get(url, headers=self.api_headers, params=params)
                    response_length = len(response.json()[response_key])
                    while response.json()['has_more']:
                        params['offset'] += 25
                        response = self.session.get(url, headers=self.api_headers, params=params)
                        response_length += len(response.json()[response_key])
                except requests.exceptions.RequestException:
                    pass
        else:
            try:
                response = self.session.get(url, headers=self.api_headers, params=params)
                response_length = len(response.json()[response_key]) if isinstance(response.json()[response_key], list) else \
                    response.json()[response_key]
            except requests.exceptions.RequestException:
                pass
        end_time = time.time()
        return {
            'endpoint': endpoint,
            'response_length': response_length,
            'response_time': end_time - start_time
        }

 If the count is lower than the limit, and if you have a support contract, please file a support case so we can look further, as we may need more information from you.

Kit
New Contributor III

Thanks Ian,

I only get the first page of the repos list. I can only recognise a few of them, and the rest of the repos are in internal path.

```

"repos": [

{

"id": {{ id }},

"path": "/Repos/.internal/.alias/f/{{ some_values }}/{{ some_values }}",

"url": {{ url }},

"provider": "{{ provider }}",

"head_commit_id": "{{ head_commit_id }}"

},

{

"id": {{ id }},

"path": "/Repos/{{ email }}/{{ repo_name }}",

"url": "{{ url }}",

"provider": "{{ provider }}",

"branch": "{{ branch }}",

"head_commit_id": "{{ head_commit_id }}"

},

{

"id": {{ id }},

"path": "/Repos/.internal/{{ some_values }}_commits/{{ head_commit_id }}",

"url": "{{ url }}",

"provider": "{{ provider }}",

"head_commit_id": "{{ head_commit_id }}"

},

```

I am using git repo as the source of the some scheduled jobs (which run every min). Perhaps these internal repos are created by the scheduled jobs.

Unfortunately, I don't have a support contract yet.

Is there any way I can get without the contract?

User16766737456
New Contributor III
New Contributor III

Thanks, @Kit Yam Tseโ€‹ -- do you have the actual count (including the /Repos/.internal ones, which are, you're correct, for the workflows)?

User16766737456
New Contributor III
New Contributor III

Just to clarify: we count both internal (from workflows, among others) and workspace repos to the 5K count. For workflows where the repos count are exceeded, execution is blocked initially for 10 minutes until the count is reduced. There is a cleanup process for finished tasks as well.

Does the job eventually get executed, or did it completely fail?

This is why it's important to get a complete repos count so we can check if this is the behaviour that you are seeing.

Anonymous
Not applicable

It seems that I have similar query, did you get the solution for this?

Hi, @Priscilla Maynard -- can you please send an email to help@databricks.com with more details? Thanks.

โ€‹

@Kit Yam Tseโ€‹ -- we are checking this internally, and will keep you posted. Thanks for reporting this.โ€‹

User16766737456
New Contributor III
New Contributor III

Just an update, to round this out.

We investigated further internally, and found that although we have a cleanup process in place to remove the internal repos that are being checked out for workflows, it was failing to catch up due to the sheer volume of jobs that were continuously failing during the repo checkout step (because of an invalid path).

This led to the limits being breached, and cascaded down to valid jobs not being able to launch.

We've worked with Kit to identify the errant job(s), and are now closely monitoring internal metrics, which currently show significant improvements.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.