- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-14-2022 11:19 AM
Hi @Aman Sehgal ,
thanks for your advice.
Unfortunately I have no influence on the partitioning of the data, I'm just a consumer 😣
Anyhow, I'd like to know why you think that Spark would be able to apply partition elimination if there would be just one partitioning level.
Imagine there would be data of 3 years, this would mean, that there would be 3*365*24=26,280 folders under \logs. As far as I can tell, Spark would still discover all those directories and load all found JSON files to memory before applying the filter.
Or are you suggesting determining the right folder manually and then loading from the correct folder?
This would be "manual" partition elimination, in my opinion.
(spark
.read
.format('json')
.load('/logs/2022_02_13_0900')
)I also tried using the col function in the filter. Unfortunately it had no performance impact over specifying the filter als "SQL condition string". 😟