11-11-2022 05:36 PM
I simply do left join on two data frame and both data frame content i was able to print.
Here is the code looks like:-
df_silver = spark.sql("select ds.PropertyID,\
ds.*
from dfsilver as ds LEFT JOIN dfaddmaster as dm \
ON ds.unit = dm.unit and ds.street = dm.street and ds.house_number = dm.house_number")
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 7.0 failed 4 times, most recent failure: Lost task 4.3 in stage 7.0 (TID 1039) (10.2.43.201 executor 1): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/python/lib/python3.8/site-packages/address_matching/__init__.py", line 5, in <module>
from address_matching.core import *
File "/databricks/python/lib/python3.8/site-packages/address_matching/core.py", line 49, in <module>
SparkSession.builder.appName("AddressMatching")
File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/databricks/spark/python/pyspark/context.py", line 147, in __init__
self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
File "/databricks/spark/python/pyspark/context.py", line 210, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/databricks/spark/python/pyspark/context.py", line 337, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1568, in __call__
return_value = get_return_value(
File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext'. Full traceback below:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/python/lib/python3.8/site-packages/address_matching/__init__.py", line 5, in <module>
from address_matching.core import *
File "/databricks/python/lib/python3.8/site-packages/address_matching/core.py", line 49, in <module>
SparkSession.builder.appName("AddressMatching")
File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/databricks/spark/python/pyspark/context.py", line 147, in __init__
self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
File "/databricks/spark/python/pyspark/context.py", line 210, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/databricks/spark/python/pyspark/context.py", line 337, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1568, in __call__
return_value = get_return_value(
File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext
11-11-2022 05:37 PM
Databricks Runtime Version
10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)
11-11-2022 05:40 PM
Multi node cluster i am using
11-13-2022 04:12 PM
cant tell you exactly without looking at the schema of the two data frames your joining but since its throwing a Serializer Error so potentially smething to do with data types
11-15-2022 09:10 AM
First, can you double check that they are pyspark dataframes?
from pyspark.sql import DataFrame
print(isinstance(df_name, DataFrame))
Next, for dataframe joins I would use the pyspark join function meant for the pyspark dataframe. If I was going to do spark SQL I would do that on a delta table. You can save those dataframes as delta tables and try your code again on the table names or you can try the pyspark left join code.
dfsilver.join(dfaddmaster, (dfsilver.unit == dfaddmaster.unit) and (dfsilver.street == dfaddmaster.street) and (dfsilver.house_number == dfaddmaster.house_number),"left").show(truncate=False)
11-18-2022 08:39 AM
Did that answer your question? Did it work?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group