Hi @Brad ,
Could you try to apply very standard optimization practices and check the outcome:
1. If your runtime is greater equal 15.2, could you implement liquid clustering on the source and target tables using JOIN columns?
ALTER TABLE <table_name>
CLUSTER BY (<clustering_columns>)
2. If your runtime is less than 15.2, could you run OPTIMIZE on source and target tables with ZORDER BY JOIN columns.
OPTIMIZE <table_name>
ZORDER BY (<column1>, <column2>)
What I believe is happening the problem is in source table, it is much bigger than target, not-partitioned, not optimized, with a lot of small files. Even if you have the check-pointing mechanism, all those files needs to be scanned to select the incremental data. If your source is 100 GB, on average the file is 100 KB, then it is much smaller than recommended 100 MB.
You have 1 million of files, all the metadata for these files (file paths, sizes, locations) will be stored in memory on the driver node. The driver will track all those files.
In summary, based on the symptoms, If you cannot use liquid clustering I would first run the OPTIMIZE with ZORDER-ing on both source and target and check if it solves the issue. If it does, then your problem is caused by the number of small files.