cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to find that given Parquet file got imported into Bronze Layer ?

Devsql
New Contributor III

Hi Team,

Recently we had created new Databricks project/solution (based on Medallion architecture) having Bronze-Silver-Gold Layer based tables. So we have created Delta-Live-Table based pipeline for Bronze-Layer implementation. Source files are Parquet files located on ADLS location ( External Location ). DLT-Pipeline reads PARQUET files from this External Location and imports data into _RAW and _APPEND_RAW ( Streaming tables ).

What we found that Parquet files are getting created serially at External Location but Bronze-Job ( a DLT based pipeline ) , running in Continuous mode, is Not able to import data from Parquet files into _raw tables.

As alternative approach, I did row-count on _RAW table, as shown below, and found that records are present for the date when we Turned-ON Bronze-DLT-Pipeline ( which is running Continuously ).

SELECT bronze_landing_date, Count(*)

FROM abc_raw

GROUP BY bronze_landing_date

As Job is running since last 10 days, we should get 10 rows of 10 Dates but I am only getting 1 row ( the date on which Job got started).

So I would like to know that How to find that given Parquet file got imported into Bronze Layer !!!

Also Is there anything we are missing in settings part for Bronze-DLT-Pipeline ?

Any pointers would be greatly appreciated.

3 REPLIES 3

raphaelblg
Databricks Employee
Databricks Employee

Hello @Devsql ,

It appears that you are creating DLT bronze tables using a standard spark.read operation. This may explain why the DLT table doesn't include "new files" during a REFRESH operation.

For incremental ingestion of bronze layer data into your DLT pipeline and tables, we recommend using Autoloader. You can find more information in the following documents:

- DLT Update Modes (Full Refresh/Refresh): https://docs.databricks.com/en/delta-live-tables/updates.html
- Autoloader: https://docs.databricks.com/en/ingestion/auto-loader/index.html#what-is-auto-loader

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

Devsql
New Contributor III

Yes @raphaelblg, we are already using Auto Loader option and i understand that.Auto Loader will continuously import file from ADLS Gen2 into Bronze-Layer-DB.

But still I am not Yet clear about how to know if given Parquet file got imported into Bronze Layer ?

Just based on Logic of Auto Loader, we need to assume that file already got imported Or Is there any mechanism for this !!!

Any article in this regard would be helpful.

Thanks

Devsql

raphaelblg
Databricks Employee
Databricks Employee

Hello @Devsql 

Autoloader initially lists files using one of the File Detection Modes. For each batch of files discovered, a checkpoint is created. If you wish to examine the state of your checkpoint, you can use the cloud_files_state SQL function, which displays all files discovered by Autoloader.

Autoloader uses Checkpointing to maintain state, maintaining exactly-once processing guarantees throughout your spark structured streaming query.

I hope my answer is helpful to you, I've attempted to provide a comprehensive overview of Databricks Autoloader's features.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group