Monday
Hey! I'm new to the forums but not Databricks, trying to get some help with this question:
The error also is also fickle since it only appears what seems to be random. Like when running the same code it works then on the next run with a new set of data it errors out, then re-running it causes it to work sometimes. This issue also occurs when calling the FeatureEngineeringClient here are the code blocks where the issues occur.
output_df has on average 200k - 1m rows and 20 columns of all doubles except for the id which is a string. This table is not particularly large at all which is weird.
1. output_df.count()
2. fe.write_table(
DBR: 15.4 LTS ML
Driver - 1 c6id.4xl (Same as workers)
workers - min 1 max 16
StackTrace is in the attached pdf.
@ me if you need any additional information
Monday
The error message "Connection reset by peer" typically indicates that the connection between the client and the server was forcibly closed by the server. This can happen for various reasons, including network issues, server overload, or configuration problems.
It appears that the error occurred during the execution of a Spark job, specifically when the Python worker was trying to communicate with the JVM process.
Do you see any resource issues under cluster metrics?
Monday
Monday
Here are some potential causes and solutions:
Intermittent Errors: The error being fickle and appearing randomly suggests that it might be related to resource availability or transient issues in the cluster. Ensure that your cluster has sufficient resources allocated, especially since your dataset size ranges from 200k to 1 million rows.
Retries Exceeded: can be related to "RETRIES_EXCEEDED" when using the FeatureEngineeringClient
. This could be due to network issues, resource constraints, or other transient failures.
Python and PySpark Compatibility: Another snippet highlights issues with Python and PySpark version compatibility. Ensure that the versions of Python and PySpark you are using are compatible with each other and with the Databricks Runtime (DBR) version you are on (15.4 LTS ML).
Custom Docker Images: If you are using a custom Docker image, ensure that all dependencies, including PySpark and the databricks-feature-engineering
package, are correctly installed and compatible with each other.
Code Path Issues: There might be issues related to the code paths or the way modules are imported. Ensure that all necessary modules and paths are correctly set up in your environment.
Cluster Configuration: Verify that your cluster configuration (driver and worker nodes) is appropriate for the workload. Sometimes, increasing the number of worker nodes or their size can help resolve intermittent issues.
Monday
The versions are compatible 15.4 LTS ML uses py 3.11 and spark 3.5.0
I am not using a docker image
shouldn't 32G and 16cores be more than enough for 200k-1m rows? They dont contain anything more than doubles and short strings and with only 20 columns they shouldn't be "large"
"This could be due to network issues, resource constraints, or other transient failures." - if there are network issues, or transient failures how do I go about resolving them.
Finally, is using the FeatureEngineeringClient the best way to go about writing to tables? Could I use df.write.saveAsTable()?
Monday - last edited Monday
@ls thanks for your question!
Since this is a PySpark application, the "Connection reset by peer" error seems to mask the actual exception. This type of issue is often linked to memory problems where Python workers are terminated, so the JVM <-> Python connection is reset. Here are some suggestions:
Memory Profiling: Try freezing the dataset that reproduces the problem and profile memory usage on the Python side. Look for issues like data distribution problems or skewness. Tools like memory-profiler might help.
Error Isolation: If possible, isolate the query and dataset causing the issue. Translating the Python code to Scala can help revealing the underlying exception.
Logs and Metrics: Check if there are any OOM-related messages in the executor logs. Additionally, analyze task metrics per executor in the Spark UI for any anomalies.
Cluster Resources: As a last resort, you could temporarily increase cluster resources to allow the job to complete. This approach can help you gather insights from the Spark UI metrics to understand why the current cluster size is insufficient for processing the dataset.
Additional Resources:
https://pypi.org/project/memory-profiler/
https://www.databricks.com/blog/2022/11/30/memory-profiling-pyspark.html
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group