Databricks Community

693872 · ‎11-11-2022

I simply do left join on two data frame and both data frame content i was able to print.

Here is the code looks like:-

df_silver = spark.sql("select ds.PropertyID,\

ds.*

from dfsilver as ds LEFT JOIN dfaddmaster as dm \

ON ds.unit = dm.unit and ds.street = dm.street and ds.house_number = dm.house_number")

org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 7.0 failed 4 times, most recent failure: Lost task 4.3 in stage 7.0 (TID 1039) (10.2.43.201 executor 1): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):

File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length

return self.loads(obj)

File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads

return pickle.loads(obj, encoding=encoding)

File "/databricks/python/lib/python3.8/site-packages/address_matching/__init__.py", line 5, in <module>

from address_matching.core import *

File "/databricks/python/lib/python3.8/site-packages/address_matching/core.py", line 49, in <module>

SparkSession.builder.appName("AddressMatching")

File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate

sc = SparkContext.getOrCreate(sparkConf)

File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate

SparkContext(conf=conf or SparkConf())

File "/databricks/spark/python/pyspark/context.py", line 147, in __init__

self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

File "/databricks/spark/python/pyspark/context.py", line 210, in _do_init

self._jsc = jsc or self._initialize_context(self._conf._jconf)

File "/databricks/spark/python/pyspark/context.py", line 337, in _initialize_context

return self._jvm.JavaSparkContext(jconf)

File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1568, in __call__

return_value = get_return_value(

File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 334, in get_return_value

raise Py4JError(

py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext'. Full traceback below:

Traceback (most recent call last):

File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length

return self.loads(obj)

File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads

return pickle.loads(obj, encoding=encoding)

File "/databricks/python/lib/python3.8/site-packages/address_matching/__init__.py", line 5, in <module>

from address_matching.core import *

File "/databricks/python/lib/python3.8/site-packages/address_matching/core.py", line 49, in <module>

SparkSession.builder.appName("AddressMatching")

File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate

sc = SparkContext.getOrCreate(sparkConf)

File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate

SparkContext(conf=conf or SparkConf())

File "/databricks/spark/python/pyspark/context.py", line 147, in __init__

self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

File "/databricks/spark/python/pyspark/context.py", line 210, in _do_init

self._jsc = jsc or self._initialize_context(self._conf._jconf)

File "/databricks/spark/python/pyspark/context.py", line 337, in _initialize_context

return self._jvm.JavaSparkContext(jconf)

File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1568, in __call__

return_value = get_return_value(

File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 334, in get_return_value

raise Py4JError(

py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext

693872 · ‎11-11-2022

Databricks Runtime Version

10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)

693872 · ‎11-11-2022

Multi node cluster i am using

PriyaAnanthram · ‎11-13-2022

cant tell you exactly without looking at the schema of the two data frames your joining but since its throwing a Serializer Error so potentially smething to do with data types

Dooley · ‎11-15-2022

First, can you double check that they are pyspark dataframes?

from pyspark.sql import DataFrame
print(isinstance(df_name, DataFrame))

Next, for dataframe joins I would use the pyspark join function meant for the pyspark dataframe. If I was going to do spark SQL I would do that on a delta table. You can save those dataframes as delta tables and try your code again on the table names or you can try the pyspark left join code.

dfsilver.join(dfaddmaster, (dfsilver.unit ==  dfaddmaster.unit) and (dfsilver.street == dfaddmaster.street) and (dfsilver.house_number == dfaddmaster.house_number),"left").show(truncate=False)