pradeep_singh
Contributor III

 

Polars and pandas don’t run on the worker nodes, so you won’t get the benefits of Databricks/Spark parallelism. If your data is small enough to fit on a single driver node, you can continue to use them. If you don’t want to do any refactoring, you might choose a larger driver node (a single-node cluster).

If the data is large and you want to benefit from Databricks/Spark parallelism, consider using the pyspark pandas API (pandas API on Spark) instead of plain pandas or Polars. Check whether the methods you use in your current codebase have equivalents in pandas on Spark. Here is the documentation: https://docs.databricks.com/aws/en/pandas/pandas-on-spark

Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev