Re: Optimizing Spark Read Performance on Delta Tab...

minhhung0507 · ‎04-04-2025

Thank you very much for your detailed and easy-to-understand explanation—it was incredibly helpful in addressing the issue. Your guidance has been a major asset in my troubleshooting process.

However, I have one further question that I hope you can shed some light on. And provide more context. I’m currently using the Play Framework in conjunction with Spark 3.3.2 and Delta 2.3 to build an API that reads data directly from Google Cloud Storage. I’ve compared the performance across three different scenarios:

1. **Spark on Databricks Runtime (v16):** Using clustering on the same source, the performance is excellent—approximately 7 seconds.

2. **Google Big Lake:** Reading from the same source also yields good performance, around 6-7 seconds.

3. **Self-hosted Spark on Play Server (Spark 3.3.2, Delta 2.3):** The performance is extremely slow—around 2 minutes.

All three methods share the same network topology and read from the same data source. Given these conditions, why is there such a large discrepancy in performance between the self-hosted Spark setup and the other two environments?

Do you have any suggestions or insights into why this might be happening? This issue is really proving to be a challenging puzzle!

Thanks again for your help.

Regards,
Hung Nguyen