Thank you for your suggestion.
Unfortunately, we do not have a unique incremental ID. Our data is identified by multiple tag_ids, with one record per tag every minute, based on a timestamp.
We initially considered using spark.readStream to load historical data month by month during low-usage periods (e.g. weekends), but we are not certain whether changing the ingestion frequency afterwards to continuous would be compatible with checkpointing and state tracking.