I recommend using Spark Structured Streaming or Auto Loader with micro-batch processing. This approach allows processing data in discrete chunks (e.g., every 10 seconds or using availableNow for backfill scenarios), rather than handling individual rows.
By setting an appropriate trigger interval (e.g., processingTime = '10 seconds' or using availableNow), each micro-batch can include a manageable volume of data (e.g., 100,000 records), making it feasible and efficient to apply transformations + data quality validations as null value checks, custom data profiling rules etc.,
Within each micro-batch, data validation rules can be implemented using even customer spark logic..
Chanukya