right, that's exactly what I'm trying to do, but have no idea how to do it!
I can chunk the spark df with the following:
def df_in_chunks(df, row_count): """ in: df out: [df1, df2, ..., df100] """ count = df.count()
if count > row_count: num_chunks = count//row_count chunk_percent = 1/num_chunks # 1% would become 0.01 return df.randomSplit([chunk_percent]*num_chunks, seed=1234) return [df]
so I have a list of spark dfs, but if I do "for df in dfs: df_pd = df.toPandas(); model.predict(df_pd)" it does it serially not in parallel, do you have any suggestion on how to make it parallel?
Thank you so much!