cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Long-running Python http POST hangs

Johan_Van_Noten
New Contributor III

As one of the steps in my data engineering pipeline, I need to perform a POST request to a http (not -s) server.
This all works fine, except for the situation described below: it then hangs indefinitely.

Environment:

  • Azure Databricks Runtime 13.3 LTS
  • Python 3.10.12
  • Executing from a notebook

Scenario:

  • Some example upload of a (big) file.
headers = {"Content-Type": f"{mime_type}"}
chunk_size = 1024*1024
response = requests.post(
	destination_repo_url, 
	headers=headers,
	auth=auth,
	timeout=10,
	data=(chunk for chunk in iter(lambda: source_file.read(chunk_size), b"")))
response.raise_for_status()​

(Obviously the timeout is to be chosen, but whatever we choose, behavior is identical)

  • Example without file upload complexity:
response = requests.post(
   destination_REST_URL_triggering_long_operation,
   auth=auth)​

Expected behavior:

  • Operation takes the required time (e.g. 5 minutes), then completes.
    The real transfer time is limited, the server's processing is long, but finally should complete without issues.
  • One would expect no difference in behaviour between a local Python and the one running on the Azure Databricks cluster, or a clear motivation why it would behave differently, and how to avoid it.

Observed behavior:

  • Operation never ends on client side (server side completes as usual).
  • The provided timeout doesn't make any difference, while you would expect the read timeout to trigger since nothing is received from the server during its long postprocessing.

Alternatives tried:

  • curl as a subprocess works as expected.
    • You see the upload taking place (in case of upload).
    • Then logs multiple lines with no transfer.
    • Then completes correctly after the server sends back its 204.
  • Trying from local system ( so not Databricks), you see using Wireshark:
    • POST operation is invoked
    • Data is sent (in case of the file transfer) and completes
    • Server is postprocessing
    • If timeout < processing time
      • Client gives up on ReadTimeout.
        This is expected, normal behavior. To avoid this, set read timeout big enough.
      • Server continues its work and completes it correctly (but doesn't report it anymore because of connection closure by client).
    • Else
      • Server replies the usual 204 once completed
      • Client completes the blocking post request normally.
  • I don't have the server's implementation under control, so I can't change that.
2 REPLIES 2

mark_ott
Databricks Employee
Databricks Employee

You are experiencing different behaviors running a long-running requests.post() operation in Azure Databricks (Python) versus running it locally. Locally, the timeout behaves as expected, but in Databricks the client “hangs indefinitely” even after server post-processing has completed and a response (204) is sent. However, alternatives like curl as a subprocess in Databricks work as expected.

Key Observations

  • The timeout parameter in requests.post() behaves as a connect and read timeout. If the server doesn’t send any bytes for longer than timeout, a ReadTimeout should trigger.

  • With Azure Databricks Runtime (ADR), Python’s networking stack might be subtly affected by the cluster’s managed environment (network/NAT-level buffering, virtualized proxies, or custom firewall policies).

  • Your alternative test with curl works, which confirms the network route and server aren’t blocking the traffic.

  • You see expected behavior running locally; only Databricks hangs indefinitely, even though the server completes successfully.

Potential Causes

  • Databricks network virtualization: Databricks clusters often run in containers or on VMs with network proxies, which can interfere with low-level socket timeout detection by Python requests.

  • Requests library limitations: In some environments (especially with HTTP/1.1 keepalives), the Python socket layer’s timeout detection can be bypassed if the underlying TCP connection is managed by an intermediary.

  • No data transfer during server post-processing: If the server sends no traffic (not even keepalive headers or HTTP chunked responses) during its post-processing, and intermediaries or the OS network stack buffer the connection, the requests library may not detect that the server is “silent” for longer than your timeout.

  • Differences in HTTP stack between requests and curl: Curl might be handling TCP-level inactivity better and not being affected by any intermediate Databricks proxy as Python is.

How to Work Around It

1. Use curl via Subprocess

Since curl works reliably in your environment, consider making the HTTP request via Python's subprocess module, capturing the output as needed.

2. Explicitly Set stream=True

Try setting stream=True in your requests.post(). Then, read the response manually with a controlled timeout using lower-level socket timeouts.

python
response = requests.post(..., stream=True, timeout=(connect_timeout, read_timeout)) for chunk in response.iter_content(chunk_size=8192, decode_unicode=False): # process chunk

But if the first byte from the server is delayed until post-processing is complete, this will not help.

3. Use Lower-Level HTTP Client

Try using http.client (stdlib) for more customizable socket-level handling.

4. Test with Different Databricks Runtimes

If possible, test the same code on different runtime versions, or an ML cluster vs. a non-ML cluster.

5. Confirm Network Middleboxes

Check if Azure NSG rules or Databricks cluster network configuration involve proxies or load balancers. These might buffer idle connections differently between Python and system-level curl.

6. Change Server Behavior (if possible)

Ask the server owner to occasionally send whitespace or HTTP/1.1 100-continue interim responses. You mentioned you can't control the server; if that's final, focus on client workarounds above.

Why the Difference?

The most probable cause is that Databricks’ network path or virtualization introduces a condition where Python's requests and underlying sockets do not get notified of a closed socket, or the network stack masks silence. Curl’s handling at the OS level might bypass this issue, or uses different buffer or keepalive logic.

Summary Table

Approach Databricks Python Local Python Databricks curl Local curl
requests.post(timeout=10) Hangs indefinitely Behaves N/A N/A
subprocess.run(['curl']) Works Works Works Works
 
 

Recommendation

For robust production pipelines in Azure Databricks, use curl or similar library via subprocess if server silence and networking quirks are causing issues for Python requests.

Johan_Van_Noten
New Contributor III

Thanks for your quick and extensive reply.
Given that I don't have any administration rights on the Azure/Databricks environment and don't have the REST-server under control, some of the sensible suggestions are difficult.
I will work with IT to check the Azure/Databricks settings.
In the meantime I will keep using the curl workaround.