cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

dataframe.rdd.isEmpty() is throwing error in 9.1 LTS

thushar
Contributor

Loaded a csv file with five columns into a dataframe, and then added around 15+ columns using dataframe.withColumn method.

After adding these many columns, when I run the query df.rdd.isEmpty() - which throws the below error.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 28) (10.139.64.4 executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

Any idea what is the issue?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

Please check your logs as it can be some other issue.

Please check also using bool(df.head(1)) instead.

View solution in original post

5 REPLIES 5

Anonymous
Not applicable

Hello again, @Thushar Rโ€‹ - I'm sorry to hear that you're having this difficulty also. Let's give the community a chance to respond first. Thanks in advance for your patience.

Hubert-Dudek
Esteemed Contributor III

Please check your logs as it can be some other issue.

Please check also using bool(df.head(1)) instead.

Thanks for the workaround. But why this particular piece of code fails in 9.0 LTS runtime and run in 8.3 without issues. Any idea. Please see the code below.

from pyspark.sql.functions import lit,col,row_number,floor,trim

df = spark.read.option("header", "true").csv(filePath)

df2 = df.select(col("cc"),col("ac"),col("an"),\

      col("ag"),col("at")).distinct()

lstOfMissingColumns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col8', 'col9','col9', 'col10', 'col11', 'col12', 'col13',

            'col14', 'col15', 'col16', 'col17']

        

for c in lstOfMissingColumns:

 df2 = df2.withColumn(c,lit(''))

   

        

df2.rdd.isEmpty()

Hi @Thushar Rโ€‹ ,

Are you using the same CSV file?

the error message is

"Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages" which could be a OOM error. How big is your CSV file? have you check the executor's 9 logs?

Anonymous
Not applicable

@Thushar Rโ€‹ - Thank you for your patience. We are looking for the best person to help you.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group