cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Here I am getting this error when i execute left join on two data frame: PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): going to post full traceback:

693872
New Contributor II

I simply do left join on two data frame and both data frame content i was able to print.

Here is the code looks like:-

df_silver = spark.sql("select ds.PropertyID,\

              ds.*

             from dfsilver as ds LEFT JOIN dfaddmaster as dm \

             ON ds.unit = dm.unit and ds.street = dm.street and ds.house_number = dm.house_number")

org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 7.0 failed 4 times, most recent failure: Lost task 4.3 in stage 7.0 (TID 1039) (10.2.43.201 executor 1): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):

File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length

return self.loads(obj)

File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads

return pickle.loads(obj, encoding=encoding)

File "/databricks/python/lib/python3.8/site-packages/address_matching/__init__.py", line 5, in <module>

from address_matching.core import *

File "/databricks/python/lib/python3.8/site-packages/address_matching/core.py", line 49, in <module>

SparkSession.builder.appName("AddressMatching")

File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate

sc = SparkContext.getOrCreate(sparkConf)

File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate

SparkContext(conf=conf or SparkConf())

File "/databricks/spark/python/pyspark/context.py", line 147, in __init__

self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

File "/databricks/spark/python/pyspark/context.py", line 210, in _do_init

self._jsc = jsc or self._initialize_context(self._conf._jconf)

File "/databricks/spark/python/pyspark/context.py", line 337, in _initialize_context

return self._jvm.JavaSparkContext(jconf)

File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1568, in __call__

return_value = get_return_value(

File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 334, in get_return_value

raise Py4JError(

py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext'. Full traceback below:

Traceback (most recent call last):

File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length

return self.loads(obj)

File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads

return pickle.loads(obj, encoding=encoding)

File "/databricks/python/lib/python3.8/site-packages/address_matching/__init__.py", line 5, in <module>

from address_matching.core import *

File "/databricks/python/lib/python3.8/site-packages/address_matching/core.py", line 49, in <module>

SparkSession.builder.appName("AddressMatching")

File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate

sc = SparkContext.getOrCreate(sparkConf)

File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate

SparkContext(conf=conf or SparkConf())

File "/databricks/spark/python/pyspark/context.py", line 147, in __init__

self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

File "/databricks/spark/python/pyspark/context.py", line 210, in _do_init

self._jsc = jsc or self._initialize_context(self._conf._jconf)

File "/databricks/spark/python/pyspark/context.py", line 337, in _initialize_context

return self._jvm.JavaSparkContext(jconf)

File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1568, in __call__

return_value = get_return_value(

File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 334, in get_return_value

raise Py4JError(

py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext

5 REPLIES 5

693872
New Contributor II

Databricks Runtime Version

10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)

693872
New Contributor II

Multi node cluster i am using

PriyaAnanthram
Contributor III

cant tell you exactly without looking at the schema of the two data frames your joining but since its throwing a Serializer Error so potentially smething to do with data types

Dooley
Valued Contributor

First, can you double check that they are pyspark dataframes?

from pyspark.sql import DataFrame
print(isinstance(df_name, DataFrame))

Next, for dataframe joins I would use the pyspark join function meant for the pyspark dataframe. If I was going to do spark SQL I would do that on a delta table. You can save those dataframes as delta tables and try your code again on the table names or you can try the pyspark left join code.

dfsilver.join(dfaddmaster, (dfsilver.unit ==  dfaddmaster.unit) and (dfsilver.street == dfaddmaster.street) and (dfsilver.house_number == dfaddmaster.house_number),"left").show(truncate=False)

Dooley
Valued Contributor

Did that answer your question? Did it work?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!