<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How can we join two pyspark dataframes side by side (without using join,equivalent to pd.concat() in pandas) ? I am trying to join two extremely large dataframes where each is of the order of 50 million. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-can-we-join-two-pyspark-dataframes-side-by-side-without/m-p/17662#M11630</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;My two dataframes look like new_df2_record1 and new_df2_record2 and the expected output dataframe I want is like new_df2:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0693f000007OoS6AAK"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2475i7E1EA18673C524D2/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693f000007OoS6AAK" alt="0693f000007OoS6AAK" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;The code I have tried is the following:&lt;/P&gt;
&lt;P&gt;If I print the top 5 rows of new_df2, it gives the output as expected but I cannot print the total count or the number of total number of columns it contains. Gives the error: &lt;/P&gt;&lt;P&gt;&lt;/P&gt;"ERROR Executor: Exception in task 2.0 in stage 6.0 (TID &lt;span class="lia-unicode-emoji" title=":smiling_face_with_sunglasses:"&gt;😎&lt;/span&gt;&lt;P&gt;&lt;/P&gt;org.apache.spark.api.python.PythonException: Traceback (most recent call last):&lt;P&gt;&lt;/P&gt; File "D:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main&lt;P&gt;&lt;/P&gt; File "D:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process&lt;P&gt;&lt;/P&gt; File "D:\Spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream&lt;P&gt;&lt;/P&gt; vs = list(itertools.islice(iterator, batch))&lt;P&gt;&lt;/P&gt; File "D:\Spark\python\lib\pyspark.zip\pyspark\serializers.py", line 326, in _load_stream_without_unbatching&lt;P&gt;&lt;/P&gt; " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))&lt;P&gt;&lt;/P&gt;ValueError: Can not deserialize PairRDD with different number of items in batches: (4096, 8192)"from pyspark.sql.types import StructType
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt;new_df2_record2 = new_df2_record2.drop('record1','record2') schema = StructType(new_df2_record1.schema.fields + new_df2_record2.schema.fields) df1df2 = new_df2_record1.rdd.zip(new_df2_record2.rdd).map(lambda x: x[0]+x[1]) new_df2 = spark.createDataFrame(df1df2, schema)&lt;/P&gt; 
&lt;P&gt;new_df2.show(5) print(new_df2.count(),len(new_df2.columns)) &lt;/P&gt;</description>
    <pubDate>Thu, 15 Jul 2021 15:11:23 GMT</pubDate>
    <dc:creator>TrinaDe</dc:creator>
    <dc:date>2021-07-15T15:11:23Z</dc:date>
    <item>
      <title>How can we join two pyspark dataframes side by side (without using join,equivalent to pd.concat() in pandas) ? I am trying to join two extremely large dataframes where each is of the order of 50 million.</title>
      <link>https://community.databricks.com/t5/data-engineering/how-can-we-join-two-pyspark-dataframes-side-by-side-without/m-p/17662#M11630</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;My two dataframes look like new_df2_record1 and new_df2_record2 and the expected output dataframe I want is like new_df2:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0693f000007OoS6AAK"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2475i7E1EA18673C524D2/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693f000007OoS6AAK" alt="0693f000007OoS6AAK" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;The code I have tried is the following:&lt;/P&gt;
&lt;P&gt;If I print the top 5 rows of new_df2, it gives the output as expected but I cannot print the total count or the number of total number of columns it contains. Gives the error: &lt;/P&gt;&lt;P&gt;&lt;/P&gt;"ERROR Executor: Exception in task 2.0 in stage 6.0 (TID &lt;span class="lia-unicode-emoji" title=":smiling_face_with_sunglasses:"&gt;😎&lt;/span&gt;&lt;P&gt;&lt;/P&gt;org.apache.spark.api.python.PythonException: Traceback (most recent call last):&lt;P&gt;&lt;/P&gt; File "D:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main&lt;P&gt;&lt;/P&gt; File "D:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process&lt;P&gt;&lt;/P&gt; File "D:\Spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream&lt;P&gt;&lt;/P&gt; vs = list(itertools.islice(iterator, batch))&lt;P&gt;&lt;/P&gt; File "D:\Spark\python\lib\pyspark.zip\pyspark\serializers.py", line 326, in _load_stream_without_unbatching&lt;P&gt;&lt;/P&gt; " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))&lt;P&gt;&lt;/P&gt;ValueError: Can not deserialize PairRDD with different number of items in batches: (4096, 8192)"from pyspark.sql.types import StructType
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt;new_df2_record2 = new_df2_record2.drop('record1','record2') schema = StructType(new_df2_record1.schema.fields + new_df2_record2.schema.fields) df1df2 = new_df2_record1.rdd.zip(new_df2_record2.rdd).map(lambda x: x[0]+x[1]) new_df2 = spark.createDataFrame(df1df2, schema)&lt;/P&gt; 
&lt;P&gt;new_df2.show(5) print(new_df2.count(),len(new_df2.columns)) &lt;/P&gt;</description>
      <pubDate>Thu, 15 Jul 2021 15:11:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-can-we-join-two-pyspark-dataframes-side-by-side-without/m-p/17662#M11630</guid>
      <dc:creator>TrinaDe</dc:creator>
      <dc:date>2021-07-15T15:11:23Z</dc:date>
    </item>
    <item>
      <title>Re: How can we join two pyspark dataframes side by side (without using join,equivalent to pd.concat() in pandas) ? I am trying to join two extremely large dataframes where each is of the order of 50 million.</title>
      <link>https://community.databricks.com/t5/data-engineering/how-can-we-join-two-pyspark-dataframes-side-by-side-without/m-p/17663#M11631</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;The code in a more legible format:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0693f000007OroyAAC"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2471i2EFBEF92C6E9DA21/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693f000007OroyAAC" alt="0693f000007OroyAAC" /&gt;&lt;/span&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 15 Jul 2021 15:21:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-can-we-join-two-pyspark-dataframes-side-by-side-without/m-p/17663#M11631</guid>
      <dc:creator>TrinaDe</dc:creator>
      <dc:date>2021-07-15T15:21:19Z</dc:date>
    </item>
  </channel>
</rss>

