Hey @timo82,
This error indicates Python workers cannot communicate with the JVM after the maintenance update. Since it's affecting all jobs after upgrading to 15.4.25.
try these steps:
--> Completely restart the cluster (stop then start, not just restart) to reinitialize socket listeners
--> Check init scripts, Temporarily remove any cluster init scripts and test if jobs succeed without them, as maintenance updates can introduce incompatibilities
--> Review Spark configurations - Check driver logs for deprecated or conflicting Spark configs that may have changed between 15.4.24 and 15.4.25
Code workarounds:
--> Add warmup operations, Insert a simple operation like df.limit(1).collect() at the start of your jobs before the main processing to establish the connection
--> Implement retry logic, Wrap initial Spark actions in try-catch blocks, as socket errors can be transient during startup
The code workarounds help address the timing and initialization issues that cause the socket error between Python workers and the JVM.
If still failing:
--> Check cluster access mode,Verify you're using the appropriate access mode (Shared or Single User) for your workload
--> Increase cluster resources, Scale up memory if errors are intermittent under load
--> Roll back to 15.4.24, If blocking production, temporarily revert while investigating further
--> Contact Databricks support, Since this affects all jobs after a maintenance update, there may be a regression in 15.4.25
harisankar