What is the most efficient way to read in a partitioned parquet file with pyspark?
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-24-2021 08:09 AM
I work with parquet files stored in AWS S3 buckets. They are multiple TB in size and partitioned by a numeric column containing integer values between 1 and 200, call it my_partition. I read in and perform compute actions on this data in Databricks with autoscaling turned off.
Labels:
- Labels:
-
AWS
-
Efficient Way
-
Parquet File
0 REPLIES 0