โ11-25-2025 08:59 AM
As one of the steps in my data engineering pipeline, I need to perform a POST request to a http (not -s) server.
This all works fine, except for the situation described below: it then hangs indefinitely.
Environment:
Scenario:
headers = {"Content-Type": f"{mime_type}"}
chunk_size = 1024*1024
response = requests.post(
destination_repo_url,
headers=headers,
auth=auth,
timeout=10,
data=(chunk for chunk in iter(lambda: source_file.read(chunk_size), b"")))
response.raise_for_status()โ(Obviously the timeout is to be chosen, but whatever we choose, behavior is identical)
response = requests.post(
destination_REST_URL_triggering_long_operation,
auth=auth)โExpected behavior:
Observed behavior:
Alternatives tried:
โ11-26-2025 04:39 AM
You are experiencing different behaviors running a long-running requests.post() operation in Azure Databricks (Python) versus running it locally. Locally, the timeout behaves as expected, but in Databricks the client โhangs indefinitelyโ even after server post-processing has completed and a response (204) is sent. However, alternatives like curl as a subprocess in Databricks work as expected.
The timeout parameter in requests.post() behaves as a connect and read timeout. If the server doesnโt send any bytes for longer than timeout, a ReadTimeout should trigger.
With Azure Databricks Runtime (ADR), Pythonโs networking stack might be subtly affected by the clusterโs managed environment (network/NAT-level buffering, virtualized proxies, or custom firewall policies).
Your alternative test with curl works, which confirms the network route and server arenโt blocking the traffic.
You see expected behavior running locally; only Databricks hangs indefinitely, even though the server completes successfully.
Databricks network virtualization: Databricks clusters often run in containers or on VMs with network proxies, which can interfere with low-level socket timeout detection by Python requests.
Requests library limitations: In some environments (especially with HTTP/1.1 keepalives), the Python socket layerโs timeout detection can be bypassed if the underlying TCP connection is managed by an intermediary.
No data transfer during server post-processing: If the server sends no traffic (not even keepalive headers or HTTP chunked responses) during its post-processing, and intermediaries or the OS network stack buffer the connection, the requests library may not detect that the server is โsilentโ for longer than your timeout.
Differences in HTTP stack between requests and curl: Curl might be handling TCP-level inactivity better and not being affected by any intermediate Databricks proxy as Python is.
curl via SubprocessSince curl works reliably in your environment, consider making the HTTP request via Python's subprocess module, capturing the output as needed.
stream=TrueTry setting stream=True in your requests.post(). Then, read the response manually with a controlled timeout using lower-level socket timeouts.
response = requests.post(..., stream=True, timeout=(connect_timeout, read_timeout))
for chunk in response.iter_content(chunk_size=8192, decode_unicode=False):
# process chunk
But if the first byte from the server is delayed until post-processing is complete, this will not help.
Try using http.client (stdlib) for more customizable socket-level handling.
If possible, test the same code on different runtime versions, or an ML cluster vs. a non-ML cluster.
Check if Azure NSG rules or Databricks cluster network configuration involve proxies or load balancers. These might buffer idle connections differently between Python and system-level curl.
Ask the server owner to occasionally send whitespace or HTTP/1.1 100-continue interim responses. You mentioned you can't control the server; if that's final, focus on client workarounds above.
The most probable cause is that Databricksโ network path or virtualization introduces a condition where Python's requests and underlying sockets do not get notified of a closed socket, or the network stack masks silence. Curlโs handling at the OS level might bypass this issue, or uses different buffer or keepalive logic.
| Approach | Databricks Python | Local Python | Databricks curl | Local curl |
|---|---|---|---|---|
| requests.post(timeout=10) | Hangs indefinitely | Behaves | N/A | N/A |
| subprocess.run(['curl']) | Works | Works | Works | Works |
For robust production pipelines in Azure Databricks, use curl or similar library via subprocess if server silence and networking quirks are causing issues for Python requests.
โ11-28-2025 03:43 AM
Thanks for your quick and extensive reply.
Given that I don't have any administration rights on the Azure/Databricks environment and don't have the REST-server under control, some of the sensible suggestions are difficult.
I will work with IT to check the Azure/Databricks settings.
In the meantime I will keep using the curl workaround.
โ11-30-2025 01:18 PM
Hello,
IMHO, having a HTTP related task in a Spark cluster is an anti-pattern. This kind of code executes at the Driver, it will be synchronous and adds overhead. This is one of the reasons, DLT (or SDP - Spark Declarative Pipeline) does not have REST based tasks.
Please review if this task can be done outside Databricks like below,
1) Event based trigger: push the result from Databricks to cloud storage; and this creates an event (Event grid) to a listener like Function/Logic App that will perform HTTP task
2) Classic Poller: Azure Function App to check for an expectation every 'n' mins. if met; execute the HTTP task