<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Parallelizing processing of multiple spark dataframes in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/parallelizing-processing-of-multiple-spark-dataframes/m-p/61392#M31781</link>
    <description>&lt;P&gt;Hi all, I am trying to create a collection rd that contains a list of spark dataframes. I want to parallelize the cleaning process for each of these dataframes. Later on, I am sending each of these dataframes to another method. However, when I parallelize, &lt;STRONG&gt;I get an error that spark context cannot be accessed from worker nodes&lt;/STRONG&gt;. I understand the error, but I wanted to learn if there is a way around it.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def import_data(code):
    
    # assume that full_path is available and model_df is imported successfully 
    model_df = (spark
            .read
            .parquet(full_path)
    )
    return model_df

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
list_code = [59, 48]
input_list = []
for code in list_code:
    input_dict = {}
    model_df = import_data(code)
    input_dict[code] = model_df
    input_list.append(input_dict)
sc = spark.sparkContext
collection_rdd = sc.parallelize(input_list)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 21 Feb 2024 19:02:03 GMT</pubDate>
    <dc:creator>Dhruv_Sinha</dc:creator>
    <dc:date>2024-02-21T19:02:03Z</dc:date>
    <item>
      <title>Parallelizing processing of multiple spark dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/parallelizing-processing-of-multiple-spark-dataframes/m-p/61392#M31781</link>
      <description>&lt;P&gt;Hi all, I am trying to create a collection rd that contains a list of spark dataframes. I want to parallelize the cleaning process for each of these dataframes. Later on, I am sending each of these dataframes to another method. However, when I parallelize, &lt;STRONG&gt;I get an error that spark context cannot be accessed from worker nodes&lt;/STRONG&gt;. I understand the error, but I wanted to learn if there is a way around it.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def import_data(code):
    
    # assume that full_path is available and model_df is imported successfully 
    model_df = (spark
            .read
            .parquet(full_path)
    )
    return model_df

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
list_code = [59, 48]
input_list = []
for code in list_code:
    input_dict = {}
    model_df = import_data(code)
    input_dict[code] = model_df
    input_list.append(input_dict)
sc = spark.sparkContext
collection_rdd = sc.parallelize(input_list)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Feb 2024 19:02:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallelizing-processing-of-multiple-spark-dataframes/m-p/61392#M31781</guid>
      <dc:creator>Dhruv_Sinha</dc:creator>
      <dc:date>2024-02-21T19:02:03Z</dc:date>
    </item>
    <item>
      <title>Re: Parallelizing processing of multiple spark dataframes</title>
      <link>https://community.databricks.com/t5/data-engineering/parallelizing-processing-of-multiple-spark-dataframes/m-p/61753#M31838</link>
      <description>&lt;P&gt;Dear &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;, thank you very much for your prompt response. This is a very detailed answer and I really appreciate all your help. Let me describe my problem more specifically. I have several datasets stored in parquet format. They are named 'xx_df', 'yy_df' ,etc. Now I want to read these datasets as Spark dataframes and I want to perform some cleaning on them. For example, I want to remove all the columns in each dataset which has more than 90% null values. Following that, I want to train a separate machine-learning model on each dataset.&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, I want to understand how can I parallelize the reading and processing of parquet datasets into spark data frames. I can share pseudo code with you if that would be helpful.&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Feb 2024 17:38:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallelizing-processing-of-multiple-spark-dataframes/m-p/61753#M31838</guid>
      <dc:creator>Dhruv_Sinha</dc:creator>
      <dc:date>2024-02-23T17:38:00Z</dc:date>
    </item>
  </channel>
</rss>

