I setup a notebook to ingest data using Auto Loader from an S3 bucket that contains over 500K CSV files into a hive table.
Recently the amount of rows (and input files) in the table grew from around 150M to 530M and now each batch takes around an hour to complete as opposed to around 1-2 minutes before the growth. I tried optimizing the table, enabling auto optimize, setting spark.sql.shuffle.partitions to 2000 in the cluster, using high performance nodes but it still takes a very long time to complete each batch.
Is there anything else I can try to improve the performance?
Thank you