<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Here I am getting this error when i execute left join on two data frame:

PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): going to post full traceback: in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22664#M15555</link>
    <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;Multi node cluster i am using&lt;/P&gt;</description>
    <pubDate>Sat, 12 Nov 2022 01:40:16 GMT</pubDate>
    <dc:creator>693872</dc:creator>
    <dc:date>2022-11-12T01:40:16Z</dc:date>
    <item>
      <title>Here I am getting this error when i execute left join on two data frame:

PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): going to post full traceback:</title>
      <link>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22662#M15553</link>
      <description>&lt;P&gt;I simply do left join on two data frame and both data frame content i was able to print.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here is the code looks like:-&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df_silver = spark.sql("select ds.PropertyID,\&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;ds.*&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;from dfsilver as ds LEFT JOIN dfaddmaster as dm \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;ON ds.unit = dm.unit and ds.street = dm.street and ds.house_number = dm.house_number")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 7.0 failed 4 times, most recent failure: Lost task 4.3 in stage 7.0 (TID 1039) (10.2.43.201 executor 1): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length&lt;/P&gt;&lt;P&gt;    return self.loads(obj)&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads&lt;/P&gt;&lt;P&gt;    return pickle.loads(obj, encoding=encoding)&lt;/P&gt;&lt;P&gt;  File "/databricks/python/lib/python3.8/site-packages/address_matching/__init__.py", line 5, in &amp;lt;module&amp;gt;&lt;/P&gt;&lt;P&gt;    from address_matching.core import *&lt;/P&gt;&lt;P&gt;  File "/databricks/python/lib/python3.8/site-packages/address_matching/core.py", line 49, in &amp;lt;module&amp;gt;&lt;/P&gt;&lt;P&gt;    SparkSession.builder.appName("AddressMatching")&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate&lt;/P&gt;&lt;P&gt;    sc = SparkContext.getOrCreate(sparkConf)&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate&lt;/P&gt;&lt;P&gt;    SparkContext(conf=conf or SparkConf())&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/context.py", line 147, in __init__&lt;/P&gt;&lt;P&gt;    self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/context.py", line 210, in _do_init&lt;/P&gt;&lt;P&gt;    self._jsc = jsc or self._initialize_context(self._conf._jconf)&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/context.py", line 337, in _initialize_context&lt;/P&gt;&lt;P&gt;    return self._jvm.JavaSparkContext(jconf)&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1568, in __call__&lt;/P&gt;&lt;P&gt;    return_value = get_return_value(&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 334, in get_return_value&lt;/P&gt;&lt;P&gt;    raise Py4JError(&lt;/P&gt;&lt;P&gt;py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext'. Full traceback below:&lt;/P&gt;&lt;P&gt;Traceback (most recent call last):&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length&lt;/P&gt;&lt;P&gt;    return self.loads(obj)&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads&lt;/P&gt;&lt;P&gt;    return pickle.loads(obj, encoding=encoding)&lt;/P&gt;&lt;P&gt;  File "/databricks/python/lib/python3.8/site-packages/address_matching/__init__.py", line 5, in &amp;lt;module&amp;gt;&lt;/P&gt;&lt;P&gt;    from address_matching.core import *&lt;/P&gt;&lt;P&gt;  File "/databricks/python/lib/python3.8/site-packages/address_matching/core.py", line 49, in &amp;lt;module&amp;gt;&lt;/P&gt;&lt;P&gt;    SparkSession.builder.appName("AddressMatching")&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate&lt;/P&gt;&lt;P&gt;    sc = SparkContext.getOrCreate(sparkConf)&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate&lt;/P&gt;&lt;P&gt;    SparkContext(conf=conf or SparkConf())&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/context.py", line 147, in __init__&lt;/P&gt;&lt;P&gt;    self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/context.py", line 210, in _do_init&lt;/P&gt;&lt;P&gt;    self._jsc = jsc or self._initialize_context(self._conf._jconf)&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/pyspark/context.py", line 337, in _initialize_context&lt;/P&gt;&lt;P&gt;    return self._jvm.JavaSparkContext(jconf)&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1568, in __call__&lt;/P&gt;&lt;P&gt;    return_value = get_return_value(&lt;/P&gt;&lt;P&gt;  File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 334, in get_return_value&lt;/P&gt;&lt;P&gt;    raise Py4JError(&lt;/P&gt;&lt;P&gt;py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 12 Nov 2022 01:36:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22662#M15553</guid>
      <dc:creator>693872</dc:creator>
      <dc:date>2022-11-12T01:36:38Z</dc:date>
    </item>
    <item>
      <title>Re: Here I am getting this error when i execute left join on two data frame:

PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): going to post full traceback:</title>
      <link>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22663#M15554</link>
      <description>&lt;P&gt;Databricks Runtime Version&lt;/P&gt;&lt;P&gt;10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)&lt;/P&gt;</description>
      <pubDate>Sat, 12 Nov 2022 01:37:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22663#M15554</guid>
      <dc:creator>693872</dc:creator>
      <dc:date>2022-11-12T01:37:47Z</dc:date>
    </item>
    <item>
      <title>Re: Here I am getting this error when i execute left join on two data frame:

PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): going to post full traceback:</title>
      <link>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22664#M15555</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;Multi node cluster i am using&lt;/P&gt;</description>
      <pubDate>Sat, 12 Nov 2022 01:40:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22664#M15555</guid>
      <dc:creator>693872</dc:creator>
      <dc:date>2022-11-12T01:40:16Z</dc:date>
    </item>
    <item>
      <title>Re: Here I am getting this error when i execute left join on two data frame:

PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): going to post full traceback:</title>
      <link>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22665#M15556</link>
      <description>&lt;P&gt;cant tell you exactly without looking at the schema of the two data frames your joining but since its throwing a Serializer Error so potentially smething to do with data types&lt;/P&gt;</description>
      <pubDate>Mon, 14 Nov 2022 00:12:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22665#M15556</guid>
      <dc:creator>PriyaAnanthram</dc:creator>
      <dc:date>2022-11-14T00:12:59Z</dc:date>
    </item>
    <item>
      <title>Re: Here I am getting this error when i execute left join on two data frame:

PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): going to post full traceback:</title>
      <link>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22666#M15557</link>
      <description>&lt;P&gt;First, can you double check that they are pyspark dataframes?&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql import DataFrame
print(isinstance(df_name, DataFrame))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Next, for dataframe joins I would use the pyspark join function meant for the pyspark dataframe. If I was going to do spark SQL I would do that on a delta table. You can save those dataframes as delta tables and try your code again on the table names or you can try the pyspark left join code.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;dfsilver.join(dfaddmaster, (dfsilver.unit ==  dfaddmaster.unit) and (dfsilver.street == dfaddmaster.street) and (dfsilver.house_number == dfaddmaster.house_number),"left").show(truncate=False)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Nov 2022 17:10:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22666#M15557</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-11-15T17:10:27Z</dc:date>
    </item>
    <item>
      <title>Re: Here I am getting this error when i execute left join on two data frame:

PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): going to post full traceback:</title>
      <link>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22667#M15558</link>
      <description>&lt;P&gt;Did that answer your question? Did it work?&lt;/P&gt;</description>
      <pubDate>Fri, 18 Nov 2022 16:39:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/here-i-am-getting-this-error-when-i-execute-left-join-on-two/m-p/22667#M15558</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-11-18T16:39:41Z</dc:date>
    </item>
  </channel>
</rss>

