Databricks Community

pjv · ‎05-10-2024

Hi all,

I have many API calls to run on a python Databricks notebook which I then run regularly on a Databricks Workflow job. When I test the following code on an all purpose cluster locally i.e. not via a job, it runs perfectly fine. However, when I run the same notebook on a job it does not work anymore. The calls are run sequentially instead of in parallel. Does anyone know why and what I can do to fix it?

Thank you!

Here is my code:

import asyncio

import requests

import nest_asyncio

nest_asyncio.apply()

async def with_threads():

def make_request(😞

response = requests.get('https://www.google.com')

return response

reqs = [asyncio.to_thread(make_request) for _ in range(0,20)]

responses = await asyncio.gather(*reqs)

return responses

async_result = asyncio.run(with_threads())

PS: The request and loop is different in my original code and only used here to explain the problem.

mhiltner · ‎05-20-2024

Would you mind sharing the cluster setup for both cases? I'd make sure that databricks Runtime is the same for both and check the number of workers allocated in each cluster.

pjv · ‎05-21-2024

I actually got it too work though I do see that if I run two jobs of the same code in parallel the async execution time slows down. Do the number of workers of the cluster on which the parallel jobs are run effect the execution time of async calls of the jobs?

Here is the code that I got to run:

# Asynchronous function to fetch data from a given URL using aiohttp

async def fetch_data(session, url😞

async with session.get(url) as response:

return await response.json()

# Asynchronous main function

async def get_url_data(input_args😞

# List of URLs to fetch data from

urls = [get_api_url(input_arg) for input_arg in input_args]

headers = {'X-API-KEY': "<API_KEY>"}

# Create an aiohttp ClientSession for making asynchronous HTTP requests

async with aiohttp.ClientSession(headers=headers) as session:

# Create a list of tasks, where each task is a call to 'fetch_data' with a specific URL

tasks = [fetch_data(session, url) for url in urls]

# Use 'asyncio.gather()' to run the tasks concurrently and gather their results

results = await asyncio.gather(*tasks, return_exceptions=False)

# Print the results obtained from fetching data from each URL

return results

Databricks Community

Asynchronous API calls from Databricks Workflow job

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!