<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Spark is not able to resolve the columns correctly when joins data frames in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30824#M22389</link>
    <description>&lt;P&gt;In my opinion problem is in select not join. Please split your code to two steps (join and select).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;After join please verify schema using next_df&lt;/P&gt;&lt;P&gt;.schema or next_df.printSchema()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please verify column names.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you don't find issue please share here schema of your days_currencies_matrix, data_to_merge&lt;/P&gt;&lt;P&gt;and next_df and I will try to help.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 27 Jan 2022 12:28:36 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2022-01-27T12:28:36Z</dc:date>
    <item>
      <title>Spark is not able to resolve the columns correctly when joins data frames</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30823#M22388</link>
      <description>&lt;P&gt;Hello all, &lt;/P&gt;&lt;P&gt;I m using pyspark ( python 3.8) over spark3.0 on Databricks. When running this DataFrame join:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;next_df = days_currencies_matrix.alias('a').join( data_to_merge.alias('b') , [ 
   days_currencies_matrix.dt == data_to_merge.RATE_DATE, 
   days_currencies_matrix.CURRENCY_CODE == data_to_merge.CURRENCY_CODE ], 'LEFT').\
   select( 
         days_currencies_matrix.CURRENCY_CODE
        ,days_currencies_matrix.dt.alias('RATE_DATE')
        ,data_to_merge.AVGYTD
        ,data_to_merge.ENDMTH
        ,data_to_merge.AVGMTH
        ,data_to_merge.AVGWEEK
        ,data_to_merge.AVGMTD
    )&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;And I’m getting this error:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Column AVGYTD#67187, AVGWEEK#67190, ENDMTH#67188, AVGMTH#67189, AVGMTD#67191 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" &amp;gt; $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Which is telling me that the above columns belong to more than one dataset. &lt;/P&gt;&lt;P&gt;Why is that happening? The code is telling to spark exactly the source dataframe; also, the days_currencies_matrix has only 2 columns: dt and CURRENCY_CODE.&lt;/P&gt;&lt;P&gt;Is it because days_currencies_matrix DataFrame actually is built over the data_to_merge? Is that something related to Lazy evaluations or it is a bug?&lt;/P&gt;&lt;P&gt;BTW, this version works with no issues:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 27 Jan 2022 10:23:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30823#M22388</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-27T10:23:31Z</dc:date>
    </item>
    <item>
      <title>Re: Spark is not able to resolve the columns correctly when joins data frames</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30824#M22389</link>
      <description>&lt;P&gt;In my opinion problem is in select not join. Please split your code to two steps (join and select).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;After join please verify schema using next_df&lt;/P&gt;&lt;P&gt;.schema or next_df.printSchema()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please verify column names.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you don't find issue please share here schema of your days_currencies_matrix, data_to_merge&lt;/P&gt;&lt;P&gt;and next_df and I will try to help.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 27 Jan 2022 12:28:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30824#M22389</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-01-27T12:28:36Z</dc:date>
    </item>
    <item>
      <title>Re: Spark is not able to resolve the columns correctly when joins data frames</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30825#M22390</link>
      <description>&lt;P&gt;Ok, I found the point...&lt;/P&gt;&lt;P&gt;the select() is about the next_df columns and I'm addressing them using the wrong way ( using the wrong dataset name).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 27 Jan 2022 15:06:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30825#M22390</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-27T15:06:06Z</dc:date>
    </item>
    <item>
      <title>Re: Spark is not able to resolve the columns correctly when joins data frames</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30826#M22391</link>
      <description>&lt;P&gt;@Alessio Palma​&amp;nbsp;- Howdy! My name is Piper, and I'm a moderator for the community. Would you be happy to mark whichever answer solved your issue so other members may find the solution more quickly?&lt;/P&gt;</description>
      <pubDate>Thu, 27 Jan 2022 17:29:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30826#M22391</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-27T17:29:44Z</dc:date>
    </item>
    <item>
      <title>Re: Spark is not able to resolve the columns correctly when joins data frames</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30827#M22392</link>
      <description>&lt;P&gt;If it is only about "Selected as Best", today I did it.&lt;/P&gt;</description>
      <pubDate>Fri, 28 Jan 2022 08:27:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-is-not-able-to-resolve-the-columns-correctly-when-joins/m-p/30827#M22392</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-28T08:27:45Z</dc:date>
    </item>
  </channel>
</rss>

