cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Parallelizing processing of multiple spark dataframes

Dhruv_Sinha
New Contributor II

Hi all, I am trying to create a collection rd that contains a list of spark dataframes. I want to parallelize the cleaning process for each of these dataframes. Later on, I am sending each of these dataframes to another method. However, when I parallelize, I get an error that spark context cannot be accessed from worker nodes. I understand the error, but I wanted to learn if there is a way around it. 

 

 

def import_data(code):
    
    # assume that full_path is available and model_df is imported successfully 
    model_df = (spark
            .read
            .parquet(full_path)
    )
    return model_df

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
list_code = [59, 48]
input_list = []
for code in list_code:
    input_dict = {}
    model_df = import_data(code)
    input_dict[code] = model_df
    input_list.append(input_dict)
sc = spark.sparkContext
collection_rdd = sc.parallelize(input_list)

 

 

1 REPLY 1

Dear @Retired_mod, thank you very much for your prompt response. This is a very detailed answer and I really appreciate all your help. Let me describe my problem more specifically. I have several datasets stored in parquet format. They are named 'xx_df', 'yy_df' ,etc. Now I want to read these datasets as Spark dataframes and I want to perform some cleaning on them. For example, I want to remove all the columns in each dataset which has more than 90% null values. Following that, I want to train a separate machine-learning model on each dataset. 

So, I want to understand how can I parallelize the reading and processing of parquet datasets into spark data frames. I can share pseudo code with you if that would be helpful.  

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now