Hi all, I am trying to create a collection rd that contains a list of spark dataframes. I want to parallelize the cleaning process for each of these dataframes. Later on, I am sending each of these dataframes to another method. However, when I parallelize, I get an error that spark context cannot be accessed from worker nodes. I understand the error, but I wanted to learn if there is a way around it.
def import_data(code):
# assume that full_path is available and model_df is imported successfully
model_df = (spark
.read
.parquet(full_path)
)
return model_df
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
list_code = [59, 48]
input_list = []
for code in list_code:
input_dict = {}
model_df = import_data(code)
input_dict[code] = model_df
input_list.append(input_dict)
sc = spark.sparkContext
collection_rdd = sc.parallelize(input_list)