Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-13-2023 07:38 AM
@Vikas Goel :
Adding more pointers here-
If you want to ensure that the Parquet files have a minimum size or contain a minimum number of records, you can use the minRecordsPerFile option when writing the DataFrame to Parquet format. Here's how you can modify your code:
# Set the desired minimum number of records per Parquet file
min_records_per_file = 1000
# Calculate the maximum number of records per file based on the desired minimum file size
max_records_per_file = int((256 * 1024 * 1024) / (df.schema.jsonSize() + 100))
# Choose the maximum value between the desired minimum and calculated maximum
records_per_file = max(min_records_per_file, max_records_per_file)
# Write the DataFrame to Parquet format with the specified minimum records per file
df.write.option("minRecordsPerFile", records_per_file).parquet("s3://your_bucket/parquet_output_path")