This happens because of lazy evaluation.
On assigning DF2 as DF.dropDuplicates([‘someColumn’]), no action is performed in the background. Infact, no data is loaded in to DF2 unless an action is performed on it.
Next, on executing action display() or show(), the dropDuplicate transformation is performed on DF and then stored in to DF2. That's how lazy evaluation works.
The number of times you invoke action, in the background fresh transformation will be performed. And because there are two different values for someColumn value '1', it'll pick otherColumn value randomly.
Lazy evalutaion approach helps Spark in filtering or skipping unnecessary transformations when an action is performed.
display(DF2.select(col('otherColumn')))
display(DF2)
How to avoid it?
Well, if you want to always have a unique value for otherColumn, then you should include another column to filter the data accordingly. A very common solution is to use timestamps - like insert or update timestamp associated with the record.
However, if you don't have any such column and you want to make sure that every time same value is used throughout the code from DF2, then use persist() operation. It will persist the contents of dataframe across operations after the first time it is computed.
from pyspark.storagelevel import StorageLevel
DF2 = DF1.drop_duplicates(['someCol']).persist(StorageLevel.MEMORY_AND_DISK)