Databricks Community

daniil_terentye · ‎07-19-2021

Hi everybody.

Looks like EXISTS statement works incorrectly.

If i execute the following statement in SQL Server it returns one row, as it should

WITH a AS (
    SELECT '1' AS id, 'Super Company' AS name
    UNION
    SELECT '2' AS id, 'SUPER COMPANY' AS name
),
b AS (
    SELECT 'a@b.com' AS user_username, 'Super Company' AS user_company
)
SELECT
    *
FROM
    b
WHERE
    NOT EXISTS (
        SELECT
          1
        FROM
          a
        WHERE
          LOWER(a.Name) = LOWER(b.user_company)
        GROUP BY
          LOWER(a.Name)
        HAVING
            COUNT(DISTINCT a.Id) = 1
    )

But if i execute the same statement in Databricks notebook it returns nothing, that is absolutely incorrect.

Cluster version is 6.4 Extended Support (includes Apache Spark 2.4.5, Scala 2.11)

Could somebody please look into this.

I know how to rewrite this statements using joins and so on. But, may be, it's possible to make EXISTS statement work in a correct way?

Thanks in advance.

Regards,

Daniil.

daniil_terentye · ‎07-19-2021

I ended up with the following

Spark - Mixed case sensitivity in Spark DataFrame, Spark SQL, and/or Databricks Table - Stack Overflow

:(((((

It works, but it's extra-ugly.

Does it works better in spark 3.x?

daniil_terentye · ‎07-20-2021

No.

:(((

Behavior of Spark 3.1.1 is the same

daniil_terentye · ‎07-21-2021

In newer versions of spark it's possible to use ANTI JOIN and SEMI JOIN

It looks this way:

WITH a AS ( SELECT '1' AS id, 'Super Company' AS name UNION SELECT '2' AS id, 'SUPER COMPANY' AS name ), b AS ( SELECT 'a@b.com' AS user_username, 'Super Company' AS user_company ) SELECT * FROM b ANTI JOIN ( SELECT LOWER(a.Name) as lower_name FROM a GROUP BY LOWER(a.Name) HAVING COUNT(DISTINCT a.Id) = 1 ) r ON lower(b.user_company) = r.lower_name

Databricks Community

EXISTS statement works incorrectly

Databricks AMER Learning Festival | Virtual Training

Introducing the Genie Hub: Ask Questions, Share Builds, and Master Conversational Analytics

🌟 Community Pulse: Your Weekly Roundup! July 13 – 19, 2026

Solution Accelerator Series | Social Determinants of Health

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions