โ08-02-2022 01:22 PM
I'd like to know what is the design pattern for ingesting data via http API request. The pattern needs use multi-hop architecture. Do we need ingest JSON output to cloud storage first (not bronze layer), then use auto loader to process data further? Or, we ingest data as JSON format for bronze layer, then process further? Thanks
โ08-03-2022 10:57 AM
The API -> Cloud Storage -> Delta is more suitable approach.
Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference evolution, supports files metadata and you can easily switch to batch processing using .trigger(once=True) or .trigger(availableNow=True).
In addition, the rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column contains any data that wasnโt parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didnโt match with that in the schema. So, if something data is added or changed in a source API you will be able to identify this modification and, so, decide what to do : either adapt the flow to integrate with other columns or just ignore it.
Finally, you will always keep your source files in json format. In that way you can re-process them as you need, export or share in the futur.
โ08-03-2022 10:57 AM
The API -> Cloud Storage -> Delta is more suitable approach.
Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference evolution, supports files metadata and you can easily switch to batch processing using .trigger(once=True) or .trigger(availableNow=True).
In addition, the rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column contains any data that wasnโt parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didnโt match with that in the schema. So, if something data is added or changed in a source API you will be able to identify this modification and, so, decide what to do : either adapt the flow to integrate with other columns or just ignore it.
Finally, you will always keep your source files in json format. In that way you can re-process them as you need, export or share in the futur.
โ08-03-2022 04:04 PM
Thank you very much. From your suggestion, now I am more clearer on what we need do now. One more question, should we use ADF to ingest data via Http linked service to landing data into data lake, this is easy and simple implemention; or we use notebook to call API via then use dataframe to save as JSON, it seems to me that from API response to saving to JSON in the data lake is not that straight forward. Is there any good sample code for best practice implementation? Thanks ahead.
โ08-04-2022 07:47 AM
Hi,
The answer depends on your strategic architecture and team knowledge. Hope the below questions will help you to chose a right solution.
Architecture :
Team :
No particular code snippet - you can save the responses to the format of your choice (file formats supported by auto loader).
Be wise - if your decision will be to take DB Notebook and you don't see the profit to use auto-loader (for example, to dissociate raw data gathering flow from data processing flow allowing you to run and scale both independently), you can write directly to Delta Table and don't create files on ADLS.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group