Databricks Community

pjv · ‎05-10-2024

Hi all,

I have many API calls to run on a python Databricks notebook which I then run regularly on a Databricks Workflow job. When I test the following code on an all purpose cluster locally i.e. not via a job, it runs perfectly fine. However, when I run the same notebook on a job it does not work anymore. The calls are run sequentially instead of in parallel. Does anyone know why and what I can do to fix it?

Thank you!

Here is my code:

import asyncio

import requests

import nest_asyncio

nest_asyncio.apply()

async def with_threads():

def make_request(😞

response = requests.get('https://www.google.com')

return response

reqs = [asyncio.to_thread(make_request) for _ in range(0,20)]

responses = await asyncio.gather(*reqs)

return responses

async_result = asyncio.run(with_threads())

PS: The request and loop is different in my original code and only used here to explain the problem.

mhiltner · ‎05-20-2024

Would you mind sharing the cluster setup for both cases? I'd make sure that databricks Runtime is the same for both and check the number of workers allocated in each cluster.

pjv · ‎05-21-2024

I actually got it too work though I do see that if I run two jobs of the same code in parallel the async execution time slows down. Do the number of workers of the cluster on which the parallel jobs are run effect the execution time of async calls of the jobs?

Here is the code that I got to run:

# Asynchronous function to fetch data from a given URL using aiohttp

async def fetch_data(session, url😞

async with session.get(url) as response:

return await response.json()

# Asynchronous main function

async def get_url_data(input_args😞

# List of URLs to fetch data from

urls = [get_api_url(input_arg) for input_arg in input_args]

headers = {'X-API-KEY': "<API_KEY>"}

# Create an aiohttp ClientSession for making asynchronous HTTP requests

async with aiohttp.ClientSession(headers=headers) as session:

# Create a list of tasks, where each task is a call to 'fetch_data' with a specific URL

tasks = [fetch_data(session, url) for url in urls]

# Use 'asyncio.gather()' to run the tasks concurrently and gather their results

results = await asyncio.gather(*tasks, return_exceptions=False)

# Print the results obtained from fetching data from each URL

return results

Databricks Community

Asynchronous API calls from Databricks Workflow job

Connect with Databricks Users in Your Area

Meet the Databricks MVPs

Databricks training invests in closing the data + AI skills gap across enterprises

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs

Data + AI Summit: Call for Presentations

Season's Speedings: Databricks SQL Delivers 4x Performance Boost Over Two Years