How to optimize conversion between PySpark and Arrow?

- - Certifications
- - Learning Paths
- - Databricks Product Tours
- - Get Started Guides
- - Product Platform Updates
- - What's New in Databricks

- - Get Started Resources
- - Events
- - Support FAQs
- - Technical Blog
- - Knowledge Sharing Hub
- - Announcements
- - DatabricksTV

- - Private Groups
- - Skills@Scale

- - Databricks Community Champions
- - Khoros Community Forums Support (Not for Databricks Product Questions)
- - Databricks Community Code of Conduct

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Seems like you can convert between dataframes and Arrow objects by using Pandas as an intermediary, but there are some limitations (e.g. it collects all records in the DataFrame to the driver and should be done on a small subset of the data, you hit type conversion warnings and run out of memory).

What's a more efficient and optimized way to convert from PySpark to Arrow?