<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Issue with Pyspark GroupBy GroupedData in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/issue-with-pyspark-groupby-groupeddata/m-p/7257#M3176</link>
    <description>&lt;P&gt;Hi @Harun Raseed Basheer​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for posting your question in our community! We are happy to assist you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 27 Mar 2023 05:23:37 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2023-03-27T05:23:37Z</dc:date>
    <item>
      <title>Issue with Pyspark GroupBy GroupedData</title>
      <link>https://community.databricks.com/t5/data-engineering/issue-with-pyspark-groupby-groupeddata/m-p/7255#M3174</link>
      <description>&lt;P&gt;Hi Guys,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am working on streaming data movement from bronze to silver. My bronze table is having a entity_name column, based on the entity_name column i need to create multiple silver tables.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried the below approach, But it is failing with error "'GroupedData' object has no attribute 'get_group'"&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sample Code Snippet :&lt;/P&gt;&lt;P&gt;grouped_df = bronze_df.groupBy("entity_name")&lt;/P&gt;&lt;P&gt;entity_names = [row.PrimaryEntityName for row in grouped_df.agg({"entity_name": "first"}).collect()]&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for entity_name in entity_names:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;entity_df = grouped_df.get_group(entity_name)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I think where/filter clause can do the needful but efficiency wise it wont be a good solution in my pov. Is there anyother approach on doing this?&lt;/P&gt;&lt;P&gt;TIA. &lt;/P&gt;</description>
      <pubDate>Wed, 22 Mar 2023 14:09:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issue-with-pyspark-groupby-groupeddata/m-p/7255#M3174</guid>
      <dc:creator>Harun</dc:creator>
      <dc:date>2023-03-22T14:09:11Z</dc:date>
    </item>
    <item>
      <title>Re: Issue with Pyspark GroupBy GroupedData</title>
      <link>https://community.databricks.com/t5/data-engineering/issue-with-pyspark-groupby-groupeddata/m-p/7256#M3175</link>
      <description>&lt;P&gt;@Harun Raseed Basheer​&amp;nbsp;:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The issue with your code is that the groupBy operation returns a GroupedData object, which does not have a get_group method. Instead, you can use the filter method to filter the bronze_df DataFrame for each entity name and write the resulting DataFrames to separate Silver tables.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here's an example of how you can modify your code to achieve this:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.functions import col
&amp;nbsp;
# Group the bronze DataFrame by entity_name
grouped_df = bronze_df.groupBy("entity_name")
&amp;nbsp;
# Extract the unique entity names
entity_names = [row.entity_name for row in grouped_df.agg({"entity_name": "first"}).collect()]
&amp;nbsp;
# Filter the bronze DataFrame for each entity name and write to a separate Silver table
for entity_name in entity_names:
    entity_df = bronze_df.filter(col("entity_name") == entity_name)
    entity_df.write.format("delta").mode("append").save(f"/mnt/silver/{entity_name}")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 23 Mar 2023 04:48:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issue-with-pyspark-groupby-groupeddata/m-p/7256#M3175</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-03-23T04:48:59Z</dc:date>
    </item>
    <item>
      <title>Re: Issue with Pyspark GroupBy GroupedData</title>
      <link>https://community.databricks.com/t5/data-engineering/issue-with-pyspark-groupby-groupeddata/m-p/7257#M3176</link>
      <description>&lt;P&gt;Hi @Harun Raseed Basheer​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for posting your question in our community! We are happy to assist you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Mar 2023 05:23:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issue-with-pyspark-groupby-groupeddata/m-p/7257#M3176</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-03-27T05:23:37Z</dc:date>
    </item>
  </channel>
</rss>

