Databricks Community

Dhruv_Sinha · ‎02-21-2024

Hi all, I am trying to create a collection rd that contains a list of spark dataframes. I want to parallelize the cleaning process for each of these dataframes. Later on, I am sending each of these dataframes to another method. However, when I parallelize, I get an error that spark context cannot be accessed from worker nodes. I understand the error, but I wanted to learn if there is a way around it.

def import_data(code):
    
    # assume that full_path is available and model_df is imported successfully 
    model_df = (spark
            .read
            .parquet(full_path)
    )
    return model_df

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
list_code = [59, 48]
input_list = []
for code in list_code:
    input_dict = {}
    model_df = import_data(code)
    input_dict[code] = model_df
    input_list.append(input_dict)
sc = spark.sparkContext
collection_rdd = sc.parallelize(input_list)

Dhruv_Sinha · ‎02-23-2024

Dear @Retired_mod, thank you very much for your prompt response. This is a very detailed answer and I really appreciate all your help. Let me describe my problem more specifically. I have several datasets stored in parquet format. They are named 'xx_df', 'yy_df' ,etc. Now I want to read these datasets as Spark dataframes and I want to perform some cleaning on them. For example, I want to remove all the columns in each dataset which has more than 90% null values. Following that, I want to train a separate machine-learning model on each dataset.

So, I want to understand how can I parallelize the reading and processing of parquet datasets into spark data frames. I can share pseudo code with you if that would be helpful.

Databricks Community

Parallelizing processing of multiple spark dataframes

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences