- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-31-2026 06:57 PM - edited 03-31-2026 07:00 PM
Polars and pandas don’t run on the worker nodes, so you won’t get the benefits of Databricks/Spark parallelism. If your data is small enough to fit on a single driver node, you can continue to use them. If you don’t want to do any refactoring, you might choose a larger driver node (a single-node cluster).
If the data is large and you want to benefit from Databricks/Spark parallelism, consider using the pyspark pandas API (pandas API on Spark) instead of plain pandas or Polars. Check whether the methods you use in your current codebase have equivalents in pandas on Spark. Here is the documentation: https://docs.databricks.com/aws/en/pandas/pandas-on-spark
Pradeep Singh - https://www.linkedin.com/in/dbxdev