topic How to find that given Parquet file got imported into Bronze Layer ? in Data Engineering

How to find that given Parquet file got imported into Bronze Layer ?

Devsql — Thu, 02 May 2024 12:50:38 GMT

Hi Team,

Recently we had created new Databricks project/solution (based on Medallion architecture) having Bronze-Silver-Gold Layer based tables. So we have created Delta-Live-Table based pipeline for Bronze-Layer implementation. Source files are Parquet files located on ADLS location ( External Location ). DLT-Pipeline reads PARQUET files from this External Location and imports data into _RAW and _APPEND_RAW ( Streaming tables ).

What we found that Parquet files are getting created serially at External Location but Bronze-Job ( a DLT based pipeline ) , running in Continuous mode, is Not able to import data from Parquet files into _raw tables.

As alternative approach, I did row-count on _RAW table, as shown below, and found that records are present for the date when we Turned-ON Bronze-DLT-Pipeline ( which is running Continuously ).

SELECT bronze_landing_date, Count(*)

FROM abc_raw

GROUP BY bronze_landing_date

As Job is running since last 10 days, we should get 10 rows of 10 Dates but I am only getting 1 row ( the date on which Job got started).

So I would like to know that How to find that given Parquet file got imported into Bronze Layer !!!

Also Is there anything we are missing in settings part for Bronze-DLT-Pipeline ?

Any pointers would be greatly appreciated.

Re: How to find that given Parquet file got imported into Bronze Layer ?

raphaelblg — Thu, 02 May 2024 23:05:35 GMT

Hello @Devsql ,

It appears that you are creating DLT bronze tables using a standard spark.read operation. This may explain why the DLT table doesn't include "new files" during a REFRESH operation.

For incremental ingestion of bronze layer data into your DLT pipeline and tables, we recommend using Autoloader. You can find more information in the following documents:

- DLT Update Modes (Full Refresh/Refresh): https://docs.databricks.com/en/delta-live-tables/updates.html
- Autoloader: https://docs.databricks.com/en/ingestion/auto-loader/index.html#what-is-auto-loader

Re: How to find that given Parquet file got imported into Bronze Layer ?

Devsql — Tue, 21 May 2024 06:33:14 GMT

Yes @raphaelblg, we are already using Auto Loader option and i understand that.Auto Loader will continuously import file from ADLS Gen2 into Bronze-Layer-DB.

But still I am not Yet clear about how to know if given Parquet file got imported into Bronze Layer ?

Just based on Logic of Auto Loader, we need to assume that file already got imported Or Is there any mechanism for this !!!

Any article in this regard would be helpful.

Thanks

Devsql

Re: How to find that given Parquet file got imported into Bronze Layer ?

raphaelblg — Tue, 21 May 2024 14:27:30 GMT

Hello @Devsql

Autoloader initially lists files using one of the File Detection Modes. For each batch of files discovered, a checkpoint is created. If you wish to examine the state of your checkpoint, you can use the cloud_files_state SQL function, which displays all files discovered by Autoloader.

Autoloader uses Checkpointing to maintain state, maintaining exactly-once processing guarantees throughout your spark structured streaming query.

I hope my answer is helpful to you, I've attempted to provide a comprehensive overview of Databricks Autoloader's features.