Timeout for dbutils.jobs.taskValues.set(key, value...

novytskyi · ‎08-14-2024

I have a job that call notebook with dbutils.jobs.taskValues.set(key, value) method and assigns around 20 parameters.

When I run it - it works.

But when I try to call 2 or more copies of a job with different parameters - it fails with error on different parts of dbutils.jobs.taskValues.set(key, value)

An error occurred while calling o366.setJson. : org.apache.http.conn.ConnectTimeoutException: Connect to us-central1.gcp.databricks.com:443 [us-central1.gcp.databricks.com/xx.xx.xx.xx] failed: connect timed out at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72) at com.databricks.common.client.RawDBHttpClient.$anonfun$httpRequestInternal$1(DBHttpClient.scala:1203) at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:582) at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:685) at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:703) at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:435) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:216) at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:433) at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:427) at com.databricks.common.client.RawDBHttpClient.withAttributionContext(DBHttpClient.scala:603) at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:481) at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:464) at com.databricks.common.client.RawDBHttpClient.withAttributionTags(DBHttpClient.scala:603) at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:680) at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:591) at com.databricks.common.client.RawDBHttpClient.recordOperationWithResultTags(DBHttpClient.scala:603) at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:582) at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:551) at com.databricks.common.client.RawDBHttpClient.recordOperation(DBHttpClient.scala:603) at com.databricks.common.client.RawDBHttpClient.httpRequestInternal(DBHttpClient.scala:1189) at com.databricks.common.client.RawDBHttpClient.entityEnclosingRequestInternal(DBHttpClient.scala:1178) at com.databricks.common.client.RawDBHttpClient.postInternal(DBHttpClient.scala:1062) at com.databricks.common.client.RawDBHttpClient.postJson(DBHttpClient.scala:757) at com.databricks.common.client.DBHttpClient.postJson(DBHttpClient.scala:574) at com.databricks.workflow.SimpleJobsSessionClient.setTaskValue(JobsSessionClient.scala:244) at com.databricks.workflow.ReliableJobsSessionClient.$anonfun$setTaskValue$1(JobsSessionClient.scala:438) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.common.client.DBHttpClient$.retryWithDeadline(DBHttpClient.scala:375) at com.databricks.workflow.ReliableJobsSessionClient.withRetry(JobsSessionClient.scala:401) at com.databricks.workflow.ReliableJobsSessionClient.setTaskValue(JobsSessionClient.scala:438) at com.databricks.workflow.WorkflowDriver.setTaskValue(WorkflowDriver.scala:52) at com.databricks.dbutils_v1.impl.TaskValuesUtilsImpl.setJson(TaskValuesUtilsImpl.scala:49) at sun.reflect.GeneratedMethodAccessor230.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397) at py4j.Gateway.invoke(Gateway.java:306) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199) at py4j.ClientServerConnection.run(ClientServerConnection.java:119) at java.lang.Thread.run(Thread.java:750) Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:613) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ... 51 more

mark_ott · ‎11-17-2025

The error you are encountering when running multiple simultaneous Databricks jobs using dbutils.jobs.taskValues.set(key, value) indicates a connection timeout issue to the Databricks backend API (connect timed out at ...us-central1.gcp.databricks.com:443) rather than a problem with your code or parameters specifically.

What This Error Means

The ConnectTimeoutException occurs when a network connection to the Databricks workspace API cannot be established within the allocated time.
When you launch several copies of the job at once (especially with many parameters), each job independently tries to communicate with the Databricks API. If there are too many simultaneous requests, they can overwhelm available network resources, Databricks API rate limits, or hit concurrency limits, leading to timeout errors.

Why Does It Work with One Job, But Not Many?

A single job doesn't stress your Databricks workspace's API/network resources.
Multiple jobs running in parallel—even if each sets only a few parameters—significantly increase the number of HTTP requests to Databricks at once, making timeouts more likely.

How To Fix & Troubleshoot

1. Stagger Job Launches

Instead of starting all job runs simultaneously, try launching them in batches with a slight delay, allowing resources and connections to recover between launches.

2. Reduce API Calls

Limit the number of calls to dbutils.jobs.taskValues.set—combine related values into a single data structure (e.g., a dictionary) and pass them all at once, reducing overall API traffic.

3. Resource and Quota Check

Check workspace resource quotas, API rate limits, and concurrent job run limits on your Databricks workspace. Databricks enforces limits per workspace — review your cluster and workspace quotas and request an increase if needed.
Ensure the cluster itself has enough network bandwidth.

4. Network Troubleshooting

Ensure no network bottlenecks exist between your cluster and the Databricks control plane. If running on a secure network, test public access, VPN latency, or firewall rules.

5. Increase Timeout

If your logic allows, increase the connection/HTTP timeout settings, if applicable, though Databricks default timeouts are intended to ensure stability.

6. Retry Logic

Implement robust retry logic for failed API calls. Some Databricks SDKs and APIs offer automatic retries for transient errors.

7. Databricks Support/Docs

If this persists, collect all error logs and submit a case to Databricks support—as this may indicate a workspace-specific networking or control plane issue not solvable by code changes.

Summary Table

Potential Cause	Resolution Step
API concurrency/rate limits	Stagger jobs, batch parameters, check quotas
Network bottlenecks	Review cluster/network configuration
Workspace resource limits	Request workspace/cluster limits increase
Excessive API calls	Reduce/aggregate parameters per call
Transient/timeout error	Add retry logic, increase timeouts

This problem is common when scaling up Databricks job orchestration and typically relates to workspace or network limitations, not the correctness of the underlying application code.