Hi @Suheb ,
You can do almost anything with data using Pyspark. Depending what you want to achieve you can :
- do you want to remove duplicates?
df = df.dropDuplicates()
- do you want to change types?
df = df.withColumn("sales", col("sales").cast("double"))
df = df.withColumn("date", col("date").cast("date"))
- do you want handle missing data?
df = df.fillna({'city': 'Unknown', 'sales': 0})
# or drop rows
df = df.na.drop(subset=['id', 'sales'])
- or maybe you want apply some transformation?
from pyspark.sql.functions import trim, upper
df = df.withColumn("country", upper(trim(col("country"))))
And many more. You can do virtually anything with data using pyspark ๐