The pyspark.mllib library is built for RDD's, and the pyspark.ml library is built for Dataframes. The RDD-based mllib library is currently in maintenance mode, while the Dataframe library will continue to receive updates and active development. For that reason, plus the fact that Dataframes are more common and are generally recommended to use, you will usually want to use the pyspark.ml library.