Here I am getting this error when i execute left join on two data frame:
PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): going to post full traceback:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-11-2022 05:36 PM
I simply do left join on two data frame and both data frame content i was able to print.
Here is the code looks like:-
df_silver = spark.sql("select ds.PropertyID,\
ds.*
from dfsilver as ds LEFT JOIN dfaddmaster as dm \
ON ds.unit = dm.unit and ds.street = dm.street and ds.house_number = dm.house_number")
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 7.0 failed 4 times, most recent failure: Lost task 4.3 in stage 7.0 (TID 1039) (10.2.43.201 executor 1): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/python/lib/python3.8/site-packages/address_matching/__init__.py", line 5, in <module>
from address_matching.core import *
File "/databricks/python/lib/python3.8/site-packages/address_matching/core.py", line 49, in <module>
SparkSession.builder.appName("AddressMatching")
File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/databricks/spark/python/pyspark/context.py", line 147, in __init__
self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
File "/databricks/spark/python/pyspark/context.py", line 210, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/databricks/spark/python/pyspark/context.py", line 337, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1568, in __call__
return_value = get_return_value(
File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext'. Full traceback below:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/python/lib/python3.8/site-packages/address_matching/__init__.py", line 5, in <module>
from address_matching.core import *
File "/databricks/python/lib/python3.8/site-packages/address_matching/core.py", line 49, in <module>
SparkSession.builder.appName("AddressMatching")
File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/databricks/spark/python/pyspark/context.py", line 147, in __init__
self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
File "/databricks/spark/python/pyspark/context.py", line 210, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/databricks/spark/python/pyspark/context.py", line 337, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1568, in __call__
return_value = get_return_value(
File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-11-2022 05:37 PM
Databricks Runtime Version
10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-11-2022 05:40 PM
Multi node cluster i am using
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-13-2022 04:12 PM
cant tell you exactly without looking at the schema of the two data frames your joining but since its throwing a Serializer Error so potentially smething to do with data types
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2022 09:10 AM
First, can you double check that they are pyspark dataframes?
from pyspark.sql import DataFrame
print(isinstance(df_name, DataFrame))
Next, for dataframe joins I would use the pyspark join function meant for the pyspark dataframe. If I was going to do spark SQL I would do that on a delta table. You can save those dataframes as delta tables and try your code again on the table names or you can try the pyspark left join code.
dfsilver.join(dfaddmaster, (dfsilver.unit == dfaddmaster.unit) and (dfsilver.street == dfaddmaster.street) and (dfsilver.house_number == dfaddmaster.house_number),"left").show(truncate=False)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2022 08:39 AM
Did that answer your question? Did it work?

