Databricks Community

TrinaDe · ‎07-15-2021

My two dataframes look like new_df2_record1 and new_df2_record2 and the expected output dataframe I want is like new_df2:

The code I have tried is the following:

If I print the top 5 rows of new_df2, it gives the output as expected but I cannot print the total count or the number of total number of columns it contains. Gives the error:

"ERROR Executor: Exception in task 2.0 in stage 6.0 (TID 😎

org.apache.spark.api.python.PythonException: Traceback (most recent call last):

File "D:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main

File "D:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process

File "D:\Spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream

vs = list(itertools.islice(iterator, batch))

File "D:\Spark\python\lib\pyspark.zip\pyspark\serializers.py", line 326, in _load_stream_without_unbatching

" in batches: (%d, %d)" % (len(key_batch), len(val_batch)))

ValueError: Can not deserialize PairRDD with different number of items in batches: (4096, 8192)"from pyspark.sql.types import StructType

new_df2_record2 = new_df2_record2.drop('record1','record2') schema = StructType(new_df2_record1.schema.fields + new_df2_record2.schema.fields) df1df2 = new_df2_record1.rdd.zip(new_df2_record2.rdd).map(lambda x: x[0]+x[1]) new_df2 = spark.createDataFrame(df1df2, schema)

new_df2.show(5) print(new_df2.count(),len(new_df2.columns))