Hello @Jonathan_
Good day!
To answer first, I think as you mentioned pandas are faster than spark because, spark works better with big data.
Pandas runs in-memory on a single machine without this distributed overhead, making it faster for small data.
As you mentioned, you have joins, aggregations -
Make sure you have AQE turned on, use broadcast join, or check if you can convern the string like date or any string values into int so that joins like sort aggregations are avoided and hash aggregations are used.
see if you can use which join is better out of shuffle or sort merge join. Here is a article for you:
the default block size is usually 128 MB, see if you use repartition or coalesce to decrease the number of partition so that many small blocks can be joined so that better optimizations can be observed.
Try to minimize shuffling operations and use filter in early stages of the data.
Turn on spark.eventLog.enabled=true, collect the logs and raise a ticket with azure databricks as these should be checked internally what is the issue, what actually went wrong and nature of the issue.
if you working with delta files, make be optimize can bring all the small files into a big one.
I am also open to other solutions from contributors.