Comparing 2 dataframes and create columns from val...

lmcglone · ‎01-11-2023

Hi,

I have a dataframe that has name and company

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

columns = ["company","name"]

data = [("company1", "Jon"), ("company2", "Steve"), ("company1", "Kim"), ("company3", "Sam"), ("company4", "Jim"), ("company4", "Tony"), ("company5", "Stan"),

]

df = spark.createDataFrame(data=data, schema = columns).show()

Then I have another dataframe that has the company names

columns2 = ["job_comany","num"]

data2 = [("company1",1), ("company2",2), ("company3",3), ("company4",4), ("company5",5),]

df2 = spark.createDataFrame(data=data2, schema = columns2).show()

What I would like to do is use the company names dataframe to search the dataframe with the person names and identify the companies associated with the people and create a dataframe with the company names as columns with a 0 or 1 with if the person is with that company. Here is a picture of what I would like to see as my final dataframe.

Hubert-Dudek · ‎01-11-2023

You need to join and pivot

df
.join(df2, on=[df.company == df2.job_company]))
.groupBy("company", "name")
.pivot("job_company")
.count()

My blog: https://databrickster.medium.com/

lmcglone · ‎01-11-2023

Thanks....that is perfect. 😀

Another question to take it a step further. From this code how would I alter the name of the produced column name. In your example there is company1, company2, etc. Is it possible to change those names to company1_a, company2_a, etc.?

Comparing 2 dataframes and create columns from values within a dataframe