3 weeks ago
As one of the steps in my data engineering pipeline, I need to perform a POST request to a http (not -s) server.
This all works fine, except for the situation described below: it then hangs indefinitely.
Environment:
Scenario:
headers = {"Content-Type": f"{mime_type}"}
chunk_size = 1024*1024
response = requests.post(
destination_repo_url,
headers=headers,
auth=auth,
timeout=10,
data=(chunk for chunk in iter(lambda: source_file.read(chunk_size), b"")))
response.raise_for_status()(Obviously the timeout is to be chosen, but whatever we choose, behavior is identical)
response = requests.post(
destination_REST_URL_triggering_long_operation,
auth=auth)Expected behavior:
Observed behavior:
Alternatives tried:
3 weeks ago
You are experiencing different behaviors running a long-running requests.post() operation in Azure Databricks (Python) versus running it locally. Locally, the timeout behaves as expected, but in Databricks the client “hangs indefinitely” even after server post-processing has completed and a response (204) is sent. However, alternatives like curl as a subprocess in Databricks work as expected.
The timeout parameter in requests.post() behaves as a connect and read timeout. If the server doesn’t send any bytes for longer than timeout, a ReadTimeout should trigger.
With Azure Databricks Runtime (ADR), Python’s networking stack might be subtly affected by the cluster’s managed environment (network/NAT-level buffering, virtualized proxies, or custom firewall policies).
Your alternative test with curl works, which confirms the network route and server aren’t blocking the traffic.
You see expected behavior running locally; only Databricks hangs indefinitely, even though the server completes successfully.
Databricks network virtualization: Databricks clusters often run in containers or on VMs with network proxies, which can interfere with low-level socket timeout detection by Python requests.
Requests library limitations: In some environments (especially with HTTP/1.1 keepalives), the Python socket layer’s timeout detection can be bypassed if the underlying TCP connection is managed by an intermediary.
No data transfer during server post-processing: If the server sends no traffic (not even keepalive headers or HTTP chunked responses) during its post-processing, and intermediaries or the OS network stack buffer the connection, the requests library may not detect that the server is “silent” for longer than your timeout.
Differences in HTTP stack between requests and curl: Curl might be handling TCP-level inactivity better and not being affected by any intermediate Databricks proxy as Python is.
curl via SubprocessSince curl works reliably in your environment, consider making the HTTP request via Python's subprocess module, capturing the output as needed.
stream=TrueTry setting stream=True in your requests.post(). Then, read the response manually with a controlled timeout using lower-level socket timeouts.
response = requests.post(..., stream=True, timeout=(connect_timeout, read_timeout))
for chunk in response.iter_content(chunk_size=8192, decode_unicode=False):
# process chunk
But if the first byte from the server is delayed until post-processing is complete, this will not help.
Try using http.client (stdlib) for more customizable socket-level handling.
If possible, test the same code on different runtime versions, or an ML cluster vs. a non-ML cluster.
Check if Azure NSG rules or Databricks cluster network configuration involve proxies or load balancers. These might buffer idle connections differently between Python and system-level curl.
Ask the server owner to occasionally send whitespace or HTTP/1.1 100-continue interim responses. You mentioned you can't control the server; if that's final, focus on client workarounds above.
The most probable cause is that Databricks’ network path or virtualization introduces a condition where Python's requests and underlying sockets do not get notified of a closed socket, or the network stack masks silence. Curl’s handling at the OS level might bypass this issue, or uses different buffer or keepalive logic.
| Approach | Databricks Python | Local Python | Databricks curl | Local curl |
|---|---|---|---|---|
| requests.post(timeout=10) | Hangs indefinitely | Behaves | N/A | N/A |
| subprocess.run(['curl']) | Works | Works | Works | Works |
For robust production pipelines in Azure Databricks, use curl or similar library via subprocess if server silence and networking quirks are causing issues for Python requests.
3 weeks ago
Thanks for your quick and extensive reply.
Given that I don't have any administration rights on the Azure/Databricks environment and don't have the REST-server under control, some of the sensible suggestions are difficult.
I will work with IT to check the Azure/Databricks settings.
In the meantime I will keep using the curl workaround.
2 weeks ago
Hello,
IMHO, having a HTTP related task in a Spark cluster is an anti-pattern. This kind of code executes at the Driver, it will be synchronous and adds overhead. This is one of the reasons, DLT (or SDP - Spark Declarative Pipeline) does not have REST based tasks.
Please review if this task can be done outside Databricks like below,
1) Event based trigger: push the result from Databricks to cloud storage; and this creates an event (Event grid) to a listener like Function/Logic App that will perform HTTP task
2) Classic Poller: Azure Function App to check for an expectation every 'n' mins. if met; execute the HTTP task
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now