Databricks Community

boitumelodikoko · ‎01-21-2025

Hi Databricks Community,

I am encountering an issue when trying to display a DataFrame in a Python notebook using serverless compute. The operation seems to fail after several retries, and I get the following error message:

[RETRIES_EXCEEDED] The maximum number of retries has been exceeded.  
File /databricks/python/lib/python3.10/site-packages/pyspark/sql/connect/client/core.py:1435, in SparkConnectClient._analyze(self, method, **kwargs)  
   1434 with attempt:  
-> 1435     resp = self._stub.AnalyzePlan(req, metadata=self.metadata())  
   1436     self._verify_response_integrity(resp)  
File /databricks/python/lib/python3.10/site-packages/pyspark/sql/connect/client/retries.py:236, in Retrying._wait(self)  
    234 # Exceeded retries  
    235 logger.debug(f"Given up on retrying. error: {repr(exception)}")  
--> 236 raise RetriesExceeded(error_class="RETRIES_EXCEEDED", message_parameters={}) from exception

Here are some additional details:

Environment: Python notebook using Databricks serverless compute.
Code example:

from functools import reduce
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, when, lit
from pyspark.sql.types import StringType, TimestampType
from tqdm import tqdm

df_10hz = df_10hz.withColumn('name', lit(None).cast(StringType()))

# Loop through each row in activity_periods and filter sensor_data
for row in tqdm(df_enrich_data.collect(), desc="Processing activity"):
    period_end = row['Timestamp']
    act_id = row['actId']

    # Debug print messages
    print(f"Processing actId: {act_id}")
    
    # Update Name column based on conditions
    df_10hz = df_10hz.withColumn("name", when(
        (col("actId") == activity_id), lit(row['name'])).otherwise(col("name")))

display(df_10hz)

Has anyone else encountered this issue? We would greatly appreciate any tips on how to resolve or debug it further!

Thank you in advance for your help!

Thanks,
Boitumelo

NandiniN · ‎01-25-2025

Hi @boitumelodikoko ,

I just created the dummy df, and the code did not throw any exception.

You can encounter RETRIES_EXCEEDED error when trying to display a DataFrame in a Python notebook using Databricks serverless computewhen the maximum number of retries for a certain operation has been exceeded, which can be due to various reasons such as network issues, resource limitations, or specific configurations.

If you can throw some light on the df that you have and how you are getting them, that can bring in any network issue.

But if it is not connecting to any third party, I would suggest you to review the compute and check, if you are able to submit commands. Are there enough resources. In the logs, you should be having another log, which indicates, what is being retried, that will give you an idea, what is the cause of failure.

This error in itself, is just indicating the max limit of retry has been reached.

boitumelodikoko · ‎01-26-2025

Hi @NandiniN,

Thank you for your response and insights. I appreciate you taking the time to help me troubleshoot this issue.

To provide more context:

DataFrame Details:
- df_10hz contains high-frequency sensor data, and I am attempting to update its name column based on activity periods from df_enrich_data.
- df_enrich_data includes enrichment data such as timestamps and activity IDs.
Environment:
- I'm using Databricks serverless compute.
- The dataset size is relatively large, which may contribute to resource constraints.
Error Context:
- The error specifically occurs when I try to display the df_10hz DataFrame using the display() function.
- I initially used a loop to iterate through each row of df_enrich_data to update the df_10hz DataFrame conditionally. However, this approach led to the [RETRIES_EXCEEDED] error.
- To troubleshoot, I tested the same logic with a smaller dataset, and it worked perfectly fine. This suggests that the issue might be related to data volume or resource limitations in the serverless compute environment.

To work around this issue, I replaced the loop with a join operation to update the df_10hz DataFrame. This approach has significantly improved performance and avoided the retry error. While the join resolves the issue, I am curious to understand why the display() function fails with the larger dataset, even after retries, and if there are specific configurations or optimizations for serverless compute that could help.

Based on your suggestion, I will:

Review the logs to identify what is being retried and determine if there are potential network or resource bottlenecks.
Continue monitoring resource usage in the serverless environment to ensure it meets the workload demands.

Do you have any additional recommendations for optimizing large DataFrame operations in serverless compute or handling display() errors with large datasets?

Thank you again for your guidance!

Thanks,
Boitumelo

NandiniN · ‎01-31-2025

For Optimize DataFrame Operations

Use cache() or persist() to cache intermediate DataFrames to avoid recomputation.
Use broadcast joins for small DataFrames and ensure join keys are properly partitioned.
Minimize shuffles by using repartition()

I believe you would like to display data to only sample them; In that case use limit(1000) or show(1000) to restrict the number of rows displayed. And you could export large datasets to external storage (e.g., DBFS, S3) and download them for analysis.

mohammedkhu · ‎02-05-2025

@boitumelodikoko I am facing the exact same issue but on all purpose compute. It works well for smaller dataset, but for large dataset it will fails with same error.

The dataset i am working on has 13M rows, and I have scaled upto n2-highmem-8 (same for worker and driver) (autoscaling 4-8), this hasnt helped either. I am thinking for trying another size up to see how it goes.

@NandiniN Unfortunately, neither cache(), persist() or localCheckpoint() or checkpoint() work, and all of them error out with same RETRIES_EXCEEDED error. I dont perform any joins persay just some pivot operations provided by discoverx library to scan all tables.

Any other suggestions you have? Or is scaling up cluster only option ?