I have two scenarios with different outcomes:
Scenario 1:
from pyspark.sql.functions import *
# create sample dataframes
df1 = spark.createDataFrame([(1, 2, 3), (2, 3, 4)], ["a", "b", "c"])
df2 = spark.createDataFrame([(1, 5, 6, 7), (2, 8, 9, 10)], ["a", "d", "e", "f"])
df3 = spark.createDataFrame([(1, 11, 12), (2, 13, 14)], ["a", "g", "h"])
# join the dataframes and select columns
base = df1.join(df2, "a", "left").join(df3, "a", "left").select("a", "b", "e", "g")
# apply filter
final = base.filter(col("c") == 1)
Scenario 2:
from pyspark.sql.functions import *
# create sample dataframes
df1 = spark.createDataFrame([(1, 2, 3), (2, 3, 4)], ["a", "b", "c"])
df2 = spark.createDataFrame([(1, 5, 6, 7), (2, 8, 9, 10)], ["a", "d", "e", "f"])
df3 = spark.createDataFrame([(1, 11, 12), (2, 13, 14)], ["a", "g", "h"])
df4 = spark.createDataFrame([(1,), (2,)], ["a"])
df5 = spark.createDataFrame([(1, "x"), (2, "y")], ["a", "z"])
# join the dataframes
base = df1.join(df2, "a", "left").join(df3, "a", "left").select("a", "b", "e", "g")
base_join = base.join(df4, "a").join(df5, "a", "left").select("a", "b")
# apply filter
final = base_join .filter(col("c") == 1)
I the first scenario filter(col("c")==1) would be applied and dataframe called final created, while on the scenario 2 it would fail with error "Column 'c' does not exist" even though both base_join df in the second scenario and base df in the first does not have this column.
Whats the difference and why it is in the physical plan?