daniel_sahal
Databricks MVP

@Michael Popp​ 

In my opinion, the best way would be to split the file to some partitions (you need to find the best-fit column) and to ingest them using Autoloader with trigger=AvailableNow (batching) and writing to the same partition as the file is partitioned.

It will allow to achieve both - parallelism and avoid data skew.

View solution in original post