cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Does partition pruning / partition elimination not work for folder partitioned JSON files? (Spark 3.1.2)

MartinB
Contributor III

Imagine the following setup:

I have log files stored as JSON files partitioned by year, month, day and hour in physical folders:

"""
/logs
|-- year=2020
|-- year=2021
`-- year=2022
    |-- month=01
    `-- month=02
        |-- day=01
        |-- day=...
        `-- day=13
            |-- hour=0000
            |-- hour=...
            `-- hour=0900
                |-- log000001.json
                |-- <many files>
                `-- log000133.json
""""

Spark supports partition discovery for folder structures like this ("All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically" https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery).

However, in contrast to PARQUET files, I found that Spark does not uses this meta information for partition pruning / partition elimination when reading JSON files.

In my use case I am only interested in logs from a specific time window (see filter😞

(spark
  .read
  .format('json')
  .load('/logs')
  .filter('year=2022 AND month=02 AND day=13 AND hour=0900')
)

I'd expect that Spark would be able to apply the filters on the partition columns "early" and only scan folders matching the filters (e.g. Spark would not need to scan folders and read files under '/logs/year=2020').

However, in practice the execution of my query takes a lot of time. It looks to me as if Spark scans first the whole filesystem starting at '/logs' reads all files and then applies the filters (on the already read data). Due to the nested folder structure and the large number of folders/files this is very expensive.

Apparently Spark does not push down the filter (applies partition pruning / partition elimination).

For me it is weird that the behavior for processing JSON files differs from Parquet.

Is this as-designed or a bug?

For now, I ended up implementing partition pruning myself in a pre-processing step by using dbutils.fs.ls for scanning the "right" folders iteratively and assembling an explicit file list that I then pass on to the spark read command.

26 REPLIES 26

Hi @Kaniz Fatma​ ,

Yes, this is what I did - and I end up with the same error every time after the login:

image

Kaniz
Community Manager
Community Manager

Hi @Martin B.​ , Thank you for sharing this. This will be taken care of for sure😊 . Let me get back to you soon.

Hi @Kaniz Fatma​ ,

Any updates on this one?

Kaniz
Community Manager
Community Manager

Hi @Martin B.​ , Did you try again?

Please try to share your feedback here.

Hi @Kaniz Fatma​ ,

Yes I did. This time no more error is displayed as before.

But following your link https://databricks.com/feedback I end up on the landing page in my community workspace; I had expected a feedback portal.

I my workspace, under "help" there is another "feedback" button:

imageBut this is just a mailto- link for the address feedback@databricks.com.

Is this the intended way to make feature requests?

Kaniz
Community Manager
Community Manager

Hi @Martin B.​ , The Ideas Portal lets you influence the Databricks product roadmap by providing feedback directly to the product team. Use the Ideas Portal to:

  • Enter feature requests.
  • View, comment, and vote up other users’ requests.
  • Monitor the progress of your favorite ideas as the Databricks product team goes through their product planning and development process.

Hi @Kaniz Fatma​ ,

When I try to access https://ideas.databricks.com/ the situation just as I described a month ago: after the login an error is displayed:

image.pngLast month, I had the understanding that you are going to check, why that is the case.

Are you positive that Databricks community edition (=free) users are allowed to access the ideas portal?

Kaniz
Community Manager
Community Manager

Hi @Martin B.​ , You need a subscription for submitting an idea.

Databricks community edition (=free) users are not allowed to access the ideas portal.

Hi @Kaniz Fatma​ ,

That's unfortunate; but thanks for the answer.

Kaniz
Community Manager
Community Manager

Hi @Martin B.​ , Databricks, Community Edition users, can get more capacity and gain production-grade functionalities by upgrading their subscription to the complete Databricks platform. To upgrade, sign-up for a 14-day free trial or contact us.

The complete Databricks platform offers production-grade functionality, such as an unlimited number of clusters that quickly scale up or down, a job launcher, collaboration, advanced security controls, and expert support. It helps users process data at scale or build Apache Spark™ applications in a team setting.

MartinB
Contributor III

@Kaniz Fatma​  could you maybe involve a Databricks expert?

Kaniz
Community Manager
Community Manager

Hi @Martin B.​ , Thank you for looping me into this conversation - I like the ideas I am seeing. I'll get back to you asap with an apt response after connecting with some of Databricks' experts on Spark.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.