Databricks Community

Dhruv_Sinha · ‎02-21-2024

Hi all, I am trying to create a collection rd that contains a list of spark dataframes. I want to parallelize the cleaning process for each of these dataframes. Later on, I am sending each of these dataframes to another method. However, when I parallelize, I get an error that spark context cannot be accessed from worker nodes. I understand the error, but I wanted to learn if there is a way around it.

def import_data(code):
    
    # assume that full_path is available and model_df is imported successfully 
    model_df = (spark
            .read
            .parquet(full_path)
    )
    return model_df

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
list_code = [59, 48]
input_list = []
for code in list_code:
    input_dict = {}
    model_df = import_data(code)
    input_dict[code] = model_df
    input_list.append(input_dict)
sc = spark.sparkContext
collection_rdd = sc.parallelize(input_list)

Dhruv_Sinha · ‎02-23-2024

Dear @Retired_mod, thank you very much for your prompt response. This is a very detailed answer and I really appreciate all your help. Let me describe my problem more specifically. I have several datasets stored in parquet format. They are named 'xx_df', 'yy_df' ,etc. Now I want to read these datasets as Spark dataframes and I want to perform some cleaning on them. For example, I want to remove all the columns in each dataset which has more than 90% null values. Following that, I want to train a separate machine-learning model on each dataset.

So, I want to understand how can I parallelize the reading and processing of parquet datasets into spark data frames. I can share pseudo code with you if that would be helpful.

Databricks Community

Parallelizing processing of multiple spark dataframes

Connect with Databricks Users in Your Area

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure

Securely share data, analytics and AI