2 weeks ago
Hi Databricks Community,
I am encountering an issue when trying to display a DataFrame in a Python notebook using serverless compute. The operation seems to fail after several retries, and I get the following error message:
[RETRIES_EXCEEDED] The maximum number of retries has been exceeded.
File /databricks/python/lib/python3.10/site-packages/pyspark/sql/connect/client/core.py:1435, in SparkConnectClient._analyze(self, method, **kwargs)
1434 with attempt:
-> 1435 resp = self._stub.AnalyzePlan(req, metadata=self.metadata())
1436 self._verify_response_integrity(resp)
File /databricks/python/lib/python3.10/site-packages/pyspark/sql/connect/client/retries.py:236, in Retrying._wait(self)
234 # Exceeded retries
235 logger.debug(f"Given up on retrying. error: {repr(exception)}")
--> 236 raise RetriesExceeded(error_class="RETRIES_EXCEEDED", message_parameters={}) from exception
Here are some additional details:
from functools import reduce
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, when, lit
from pyspark.sql.types import StringType, TimestampType
from tqdm import tqdm
df_10hz = df_10hz.withColumn('name', lit(None).cast(StringType()))
# Loop through each row in activity_periods and filter sensor_data
for row in tqdm(df_enrich_data.collect(), desc="Processing activity"):
period_end = row['Timestamp']
act_id = row['actId']
# Debug print messages
print(f"Processing actId: {act_id}")
# Update Name column based on conditions
df_10hz = df_10hz.withColumn("name", when(
(col("actId") == activity_id), lit(row['name'])).otherwise(col("name")))
display(df_10hz)
Has anyone else encountered this issue? We would greatly appreciate any tips on how to resolve or debug it further!
Thank you in advance for your help!
2 weeks ago - last edited 2 weeks ago
Hi @boitumelodikoko ,
I just created the dummy df, and the code did not throw any exception.
You can encounter RETRIES_EXCEEDED
error when trying to display a DataFrame in a Python notebook using Databricks serverless computewhen the maximum number of retries for a certain operation has been exceeded, which can be due to various reasons such as network issues, resource limitations, or specific configurations.
If you can throw some light on the df that you have and how you are getting them, that can bring in any network issue.
But if it is not connecting to any third party, I would suggest you to review the compute and check, if you are able to submit commands. Are there enough resources. In the logs, you should be having another log, which indicates, what is being retried, that will give you an idea, what is the cause of failure.
This error in itself, is just indicating the max limit of retry has been reached.
2 weeks ago
Hi @NandiniN,
Thank you for your response and insights. I appreciate you taking the time to help me troubleshoot this issue.
To provide more context:
DataFrame Details:
Environment:
Error Context:
To work around this issue, I replaced the loop with a join operation to update the df_10hz DataFrame. This approach has significantly improved performance and avoided the retry error. While the join resolves the issue, I am curious to understand why the display() function fails with the larger dataset, even after retries, and if there are specific configurations or optimizations for serverless compute that could help.
Based on your suggestion, I will:
Do you have any additional recommendations for optimizing large DataFrame operations in serverless compute or handling display() errors with large datasets?
Thank you again for your guidance!
Friday
For Optimize DataFrame Operations
cache()
or persist()
to cache intermediate DataFrames to avoid recomputation.repartition()
I believe you would like to display data to only sample them; In that case use limit(1000)
or show(1000)
to restrict the number of rows displayed. And you could export large datasets to external storage (e.g., DBFS, S3) and download them for analysis.
Wednesday
@boitumelodikoko I am facing the exact same issue but on all purpose compute. It works well for smaller dataset, but for large dataset it will fails with same error.
The dataset i am working on has 13M rows, and I have scaled upto n2-highmem-8 (same for worker and driver) (autoscaling 4-8), this hasnt helped either. I am thinking for trying another size up to see how it goes.
@NandiniN Unfortunately, neither cache(), persist() or localCheckpoint() or checkpoint() work, and all of them error out with same RETRIES_EXCEEDED error. I dont perform any joins persay just some pivot operations provided by discoverx library to scan all tables.
Any other suggestions you have? Or is scaling up cluster only option ?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group