NandiniN
Databricks Employee
Databricks Employee

For Optimize DataFrame Operations

  • Use cache() or persist() to cache intermediate DataFrames to avoid recomputation.
  • Use broadcast joins for small DataFrames and ensure join keys are properly partitioned.
  • Minimize shuffles by using repartition()

I believe you would like to display data to only sample them; In that case use limit(1000) or show(1000) to restrict the number of rows displayed. And you could export large datasets to external storage (e.g., DBFS, S3) and download them for analysis.