How do I know if the number of files are causing performance issues?

User16826992666 — Wed, 16 Jun 2021 16:42:52 GMT

I have read and heard that having too many small files can cause performance problems when reading large data sets. But how do I know if that is an issue I am facing?

Re: How do I know if the number of files are causing performance issues?

sajith_appukutt — Fri, 18 Jun 2021 20:47:00 GMT

Databricks SQL endpoint has a query history section which provides additional information to debug / tune queries. One such metric under execution details is the number of files read.

For ETL/Data science workloads, you could use the Spark UI of the cluster and click on "Query Details" to get this info.

topic How do I know if the number of files are causing performance issues? in Data Engineering

How do I know if the number of files are causing performance issues?

Re: How do I know if the number of files are causing performance issues?