Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-13-2022 02:16 PM
Instead of nested directories, could you try single level partition and have you partition names as `year_month_day_hour` (assuming that you have your JSON files in hour directory only). In that way spark knows in one shot which partition it has to look at.
Querying could be expensive if your JSON files are very small in size (in KBs probably).
Maybe check the file sizes and instead of having log files per hour, you would be better off by having them partitioned by per day.
Last, maybe try querying using col function. Not sure if it'll help, but worth giving a try.
from pyspark.sql.functions import col
spark
.read
.format('json')
.load('/logs')
.filter( (col('year')=2022) & (col('month')=02) & (col('day')=13) & (col('hour')=0900'))