<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to implement the where not exists pattern in scala? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13821#M8422</link>
    <description>&lt;P&gt;Hello, @Tiago Rente​! My name is Piper and I'm a moderator for Databricks. It's great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will follow up shortly with a response.&lt;/P&gt;</description>
    <pubDate>Fri, 08 Oct 2021 20:43:48 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2021-10-08T20:43:48Z</dc:date>
    <item>
      <title>How to implement the where not exists pattern in scala?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13820#M8421</link>
      <description>&lt;P&gt;I have a dataframe with the following columns:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Key1&lt;/LI&gt;&lt;LI&gt;Key2&lt;/LI&gt;&lt;LI&gt;Y_N_Col&lt;/LI&gt;&lt;LI&gt;Col1&lt;/LI&gt;&lt;LI&gt;Col2&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For the key tuple (Key1, Key2), I have rows with Y_N_Col = "Y" and Y_N_Col = "N".&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I need a new dataframe with all rows with Y_N_Col = "Y" (regardless of the key tuple), plus all Y_N_Col = "N" for which there are no Y_N_Col = "Y" for the same key tuple.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The dataframe is already calculated in a Scala notebook.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks in advance,&lt;/P&gt;&lt;P&gt;Tiago R.&lt;/P&gt;</description>
      <pubDate>Fri, 08 Oct 2021 17:04:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13820#M8421</guid>
      <dc:creator>tarente</dc:creator>
      <dc:date>2021-10-08T17:04:23Z</dc:date>
    </item>
    <item>
      <title>Re: How to implement the where not exists pattern in scala?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13821#M8422</link>
      <description>&lt;P&gt;Hello, @Tiago Rente​! My name is Piper and I'm a moderator for Databricks. It's great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will follow up shortly with a response.&lt;/P&gt;</description>
      <pubDate>Fri, 08 Oct 2021 20:43:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13821#M8422</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-10-08T20:43:48Z</dc:date>
    </item>
    <item>
      <title>Re: How to implement the where not exists pattern in scala?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13822#M8423</link>
      <description>&lt;P&gt;I'd use a left-anti join.&lt;/P&gt;&lt;P&gt;So create a df with all the Y, then create a df with all the N and do a left_anti join (on key1 and key2) on the df with the Y.&lt;/P&gt;&lt;P&gt;then a union of those two.&lt;/P&gt;</description>
      <pubDate>Mon, 11 Oct 2021 10:21:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13822#M8423</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-11T10:21:51Z</dc:date>
    </item>
    <item>
      <title>Re: How to implement the where not exists pattern in scala?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13823#M8424</link>
      <description>&lt;P&gt;Hi werners,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks for your answer.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I implemented your suggestion and the solution that I was seeking, but not sure which one is more performant.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The solution I was seeking is:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;// My data is stored in the dfDups
// Create a Temp View
dfDups
  .createOrReplaceTempView("Dups")
&amp;nbsp;
// Create a new df without the "duplicates"
val dfNoDups = sqlContext.sql("""
  select *
    from Dups as Y
   where Y.Y_N_Col = 'Y'
   union all
  select *
    from Dups as N
   where N.Y_N_Col = 'N'
     and not exists (
                     select 1
                       from Dups as Y
                      where Y.Y_N_Col = 'Y'
                        and Y.Key1 = N.Key1
                        and Y.Key2 = N.Key2
                    )
  """)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Tiago R.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 10:34:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13823#M8424</guid>
      <dc:creator>tarente</dc:creator>
      <dc:date>2021-10-12T10:34:00Z</dc:date>
    </item>
    <item>
      <title>Re: How to implement the where not exists pattern in scala?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13824#M8425</link>
      <description>&lt;P&gt;I am not sure.  In spark 2, the where not exists was actually planned using a left_anti join.  In spark 3 I don't know if this has changed.&lt;/P&gt;&lt;P&gt;But you can display the query plan for both solutions (and try them both).&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 11:08:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13824#M8425</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-12T11:08:32Z</dc:date>
    </item>
    <item>
      <title>Re: How to implement the where not exists pattern in scala?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13825#M8426</link>
      <description>&lt;P&gt;Yes- Spark Catalyst optimizer is smart. It is possible that both query plans will actually have the same plan after the optimizers gets done with it. You can get the plan using:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;dfNoDups.explain()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 21:47:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13825#M8426</guid>
      <dc:creator>Dan_Z</dc:creator>
      <dc:date>2021-10-20T21:47:54Z</dc:date>
    </item>
    <item>
      <title>Re: How to implement the where not exists pattern in scala?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13826#M8427</link>
      <description>&lt;P&gt;Thanks for your answer, I did not know of the &lt;I&gt;explain&lt;/I&gt;.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I did some tests and they execute in similar times.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I ended up using the solution suggested by werners, because would easier to understand and maintain in the future.&lt;/P&gt;</description>
      <pubDate>Thu, 21 Oct 2021 08:06:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-implement-the-where-not-exists-pattern-in-scala/m-p/13826#M8427</guid>
      <dc:creator>tarente</dc:creator>
      <dc:date>2021-10-21T08:06:41Z</dc:date>
    </item>
  </channel>
</rss>

