Hi @sai_sathya, Automating the process of validating record counts from SQL Server database tables and Azure data lake Gen2 can be done with PySpark code.
- PySpark script can connect to the SQL Server database, retrieve record counts for each table, write the data to Azure data lake Gen2 as a Parquet file, and validate the record count.
- The script can be scheduled to run automatically at specific intervals using Databricks jobs.
- Databricks jobs allow the scheduling of notebooks or JARs to automate the data validation process.
- The provided PySpark code demonstrates how to structure the code for this process.
- The code includes steps to connect to SQL Server, retrieve record counts, write data to Azure data lake Gen2, read the Parquet file, and validate record counts.
- The code can be customized by replacing the placeholders <hostname>
, <port>
, <database>
, <table>
, <username>
, <password>
, and <path_to_parquet_file>
with actual values.