Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Showing results for 
Search instead for 
Did you mean: 

How to merge two data frames column-wise in Apache Spark

New Contributor

I have the following two data frames which have just one column each and have exact same number of rows. How do I merge them so that I get a new data frame which has the two columns and all rows from both the data frames. For example,


| ColA|
|    1|
|    2|
|    3|
|    4|


| ColB|
|    5|
|    6|
|    7|
|    8|

I want the result of the merge to be


| ColA| ColB|
|    1|    5|
|    2|    6|
|    3|    7|
|    4|    8|

I don't quite see how I can do this with the join method because there is only one column and joining without any condition will create a cartesian join between the two columns. Is there a direct SPARK Data Frame API call to do this? In R Data Frames, I see that there a merge function to merge two data frames. However, I don't know if it is similar to join.




New Contributor II

Hi All , Even I am working on the same Problem ? Any findings ? Thanks !

New Contributor II

@bhosskie and @Govind89

Think what is asked is to merge all columns, one way could be to create monotonically_increasing_id() column, only if each of the dataframes are exactly the same number of rows, then joining on the ids. The number of columns in each dataframe can be different.

from pyspark.sql.functions import monotonically_increasing_id

df1 = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar", "bar","aaa","foo"), ("aaa", "bbb","ccc","ddd")], ("k", "K" ,"v" ,"V"))

df2 = sqlContext.createDataFrame([("aaa", "bbb","ddd"), ("www", "eee","rrr"), ("jjj", "rrr","www")], ("m", "M" ,"n"))

df1 = df1.withColumn("id", monotonically_increasing_id())

df2 = df2.withColumn("id", monotonically_increasing_id())

df3 = df2.join(df1, "id", "outer").drop("id")

Gives output of the columns merged, although the order is reversed in the rows displayed.

Thanks! This seems to work for me; although I was nervous about using monotonically_increasing_id() because the ids could be different for corresponding rows in the two dataframes if any parts of them were generated in different partitions. But I guess if df2 is generated from df1 the chances of that happening in practice are fairly low...? Anyway, I use a "left_outer" join to make sure I keep the rows I want in the dataframe I consider most important.

New Contributor II

I have the same problem. Is there any solution without using join.



Thanks for the data at the trash bag cover. I did manage to discover some clean ones that could do the trick, however, they're very extensive so will discern a few manners of lowering the diameter. If I positioned elastic around them, it will compress the lure too so welcome any greater tips. The lure itself has been operating wonderfully! I'm capturing a few butterflies I have not seen in years and I scored a fairly massive moth too. I'm having two problems though: flies and hornets. I even have zippers along the side, however, the bugs generally tend to want to exit the top and not the aspect. I assume I'm going to need to add zippers to the top coming at one another from angles so that after both are unzipped, I can just fold again the cloth. Since this can take some doing, perhaps you could propose a higher way? Thanks once more! Read greater at


Hi, if you are looking for the answer for “Why My Brother Printer is Offline“? There could be many reasons for receiving message Brother Printer is offline. Follow the easy steps in the article to resolve Brother Printer Offline issues.

New Contributor II

I have the same problem

New Contributor II


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

sc = spark.sparkContext

sqlDF1 = spark.sql("select count(*) as Total FROM user_summary")

sqlDF2 = spark.sql("select count(*) as Total_New FROM user_summary")



Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!