Parallelizing processing of multiple spark dataframes

Dhruv_Sinha — Wed, 21 Feb 2024 19:02:03 GMT

Hi all, I am trying to create a collection rd that contains a list of spark dataframes. I want to parallelize the cleaning process for each of these dataframes. Later on, I am sending each of these dataframes to another method. However, when I parallelize, I get an error that spark context cannot be accessed from worker nodes. I understand the error, but I wanted to learn if there is a way around it.

def import_data(code): # assume that full_path is available and model_df is imported successfully model_df = (spark .read .parquet(full_path) ) return model_df from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() list_code = [59, 48] input_list = [] for code in list_code: input_dict = {} model_df = import_data(code) input_dict[code] = model_df input_list.append(input_dict) sc = spark.sparkContext collection_rdd = sc.parallelize(input_list)

Re: Parallelizing processing of multiple spark dataframes

Dhruv_Sinha — Fri, 23 Feb 2024 17:38:00 GMT

Dear @Retired_mod, thank you very much for your prompt response. This is a very detailed answer and I really appreciate all your help. Let me describe my problem more specifically. I have several datasets stored in parquet format. They are named 'xx_df', 'yy_df' ,etc. Now I want to read these datasets as Spark dataframes and I want to perform some cleaning on them. For example, I want to remove all the columns in each dataset which has more than 90% null values. Following that, I want to train a separate machine-learning model on each dataset.

So, I want to understand how can I parallelize the reading and processing of parquet datasets into spark data frames. I can share pseudo code with you if that would be helpful.

topic Parallelizing processing of multiple spark dataframes in Data Engineering

Parallelizing processing of multiple spark dataframes

Re: Parallelizing processing of multiple spark dataframes