3 weeks ago
Hey! I'm new to the forums but not Databricks, trying to get some help with this question:
The error also is also fickle since it only appears what seems to be random. Like when running the same code it works then on the next run with a new set of data it errors out, then re-running it causes it to work sometimes. This issue also occurs when calling the FeatureEngineeringClient here are the code blocks where the issues occur.
output_df has on average 200k - 1m rows and 20 columns of all doubles except for the id which is a string. This table is not particularly large at all which is weird.
1. output_df.count()
2. fe.write_table(
DBR: 15.4 LTS ML
Driver - 1 c6id.4xl (Same as workers)
workers - min 1 max 16
StackTrace is in the attached pdf.
@ me if you need any additional information
2 weeks ago
3 weeks ago
The error message "Connection reset by peer" typically indicates that the connection between the client and the server was forcibly closed by the server. This can happen for various reasons, including network issues, server overload, or configuration problems.
It appears that the error occurred during the execution of a Spark job, specifically when the Python worker was trying to communicate with the JVM process.
Do you see any resource issues under cluster metrics?
3 weeks ago
3 weeks ago
Here are some potential causes and solutions:
Intermittent Errors: The error being fickle and appearing randomly suggests that it might be related to resource availability or transient issues in the cluster. Ensure that your cluster has sufficient resources allocated, especially since your dataset size ranges from 200k to 1 million rows.
Retries Exceeded: can be related to "RETRIES_EXCEEDED" when using the FeatureEngineeringClient
. This could be due to network issues, resource constraints, or other transient failures.
Python and PySpark Compatibility: Another snippet highlights issues with Python and PySpark version compatibility. Ensure that the versions of Python and PySpark you are using are compatible with each other and with the Databricks Runtime (DBR) version you are on (15.4 LTS ML).
Custom Docker Images: If you are using a custom Docker image, ensure that all dependencies, including PySpark and the databricks-feature-engineering
package, are correctly installed and compatible with each other.
Code Path Issues: There might be issues related to the code paths or the way modules are imported. Ensure that all necessary modules and paths are correctly set up in your environment.
Cluster Configuration: Verify that your cluster configuration (driver and worker nodes) is appropriate for the workload. Sometimes, increasing the number of worker nodes or their size can help resolve intermittent issues.
3 weeks ago
The versions are compatible 15.4 LTS ML uses py 3.11 and spark 3.5.0
I am not using a docker image
shouldn't 32G and 16cores be more than enough for 200k-1m rows? They dont contain anything more than doubles and short strings and with only 20 columns they shouldn't be "large"
"This could be due to network issues, resource constraints, or other transient failures." - if there are network issues, or transient failures how do I go about resolving them.
Finally, is using the FeatureEngineeringClient the best way to go about writing to tables? Could I use df.write.saveAsTable()?
3 weeks ago - last edited 3 weeks ago
@ls thanks for your question!
Since this is a PySpark application, the "Connection reset by peer" error seems to mask the actual exception. This type of issue is often linked to memory problems where Python workers are terminated, so the JVM <-> Python connection is reset. Here are some suggestions:
Memory Profiling: Try freezing the dataset that reproduces the problem and profile memory usage on the Python side. Look for issues like data distribution problems or skewness. Tools like memory-profiler might help.
Error Isolation: If possible, isolate the query and dataset causing the issue. Translating the Python code to Scala can help revealing the underlying exception.
Logs and Metrics: Check if there are any OOM-related messages in the executor logs. Additionally, analyze task metrics per executor in the Spark UI for any anomalies.
Cluster Resources: As a last resort, you could temporarily increase cluster resources to allow the job to complete. This approach can help you gather insights from the Spark UI metrics to understand why the current cluster size is insufficient for processing the dataset.
Additional Resources:
https://pypi.org/project/memory-profiler/
https://www.databricks.com/blog/2022/11/30/memory-profiling-pyspark.html
2 weeks ago
Solution was moving to DBR 16.1
2 weeks ago - last edited 2 weeks ago
@ls thanks for sharing!
if you don't mind and the details can be shared, what was the root cause? Is it a known issue which is only fixed on DBR 16.1or more of a product improvement?
2 weeks ago
I did nothing different than just switch the Runtime from 15.4 ML LTS to 16.1 ML and it write to the table just fine. I dont know exactly why it worked but I believe that the spark version might be different between the two and that could be the cause. The exact reason is still ambiguous to me.
2 weeks ago
However, this fix is not perfect. The same error occurs but significantly less often. I have also switched the cluster settings as such:
r5d.8xlarge - worker/driver
worker count 1-16 1 on-demand 15 spot
spark config:
spark.driver.maxResultSize 16g
2 weeks ago
@ls Agree, it doesn't seem to be fixed. Maybe on DBR 16 memory management is better optimized, hence I'd like to suggest going through the methods mentioned earlier in this post:
Memory Profiling: Try freezing the dataset that reproduces the problem and profile memory usage on the Python side. Look for issues like data distribution problems or skewness. Tools like memory-profiler might help.
Error Isolation: If possible, isolate the query and dataset causing the issue. Translating the Python code to Scala can help revealing the underlying exception.(This will not be straightforward). Instead, works needs to be done to capture the Python side thread dumps upon OOM error (traceback and/or faulthandler modules can become useful)
Logs and Metrics: Check if there are any OOM-related messages in the executor logs. Additionally, analyze task metrics per executor in the Spark UI for any anomalies.
Cluster Resources: As a last resort, you could temporarily increase cluster resources to allow the job to complete. This approach can help you gather insights from the Spark UI metrics to understand why the current cluster size is insufficient for processing the dataset.
Additional Resources:
https://pypi.org/project/memory-profiler/
https://www.databricks.com/blog/2022/11/30/memory-profiling-pyspark.html
If there are any challenges with any of the above steps, e.g.: with the memory profiling, you may raise a Support Ticket, so that we can better assist you in identifying the root cause and ultimately fixing the OOMs.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group