Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2025 10:52 PM
For Optimize DataFrame Operations
- Use
cache()orpersist()to cache intermediate DataFrames to avoid recomputation. - Use broadcast joins for small DataFrames and ensure join keys are properly partitioned.
- Minimize shuffles by using
repartition()
I believe you would like to display data to only sample them; In that case use limit(1000) or show(1000) to restrict the number of rows displayed. And you could export large datasets to external storage (e.g., DBFS, S3) and download them for analysis.