Databricks

ftc · ‎08-02-2022

I'd like to know what is the design pattern for ingesting data via http API request. The pattern needs use multi-hop architecture. Do we need ingest JSON output to cloud storage first (not bronze layer), then use auto loader to process data further? Or, we ingest data as JSON format for bronze layer, then process further? Thanks

artsheiko · ‎08-03-2022

The API -> Cloud Storage -> Delta is more suitable approach.

Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference evolution, supports files metadata and you can easily switch to batch processing using .trigger(once=True) or .trigger(availableNow=True).

In addition, the rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column contains any data that wasn’t parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the schema. So, if something data is added or changed in a source API you will be able to identify this modification and, so, decide what to do : either adapt the flow to integrate with other columns or just ignore it.

Finally, you will always keep your source files in json format. In that way you can re-process them as you need, export or share in the futur.

View solution in original post

artsheiko · ‎08-03-2022

The API -> Cloud Storage -> Delta is more suitable approach.

Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference evolution, supports files metadata and you can easily switch to batch processing using .trigger(once=True) or .trigger(availableNow=True).

In addition, the rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column contains any data that wasn’t parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the schema. So, if something data is added or changed in a source API you will be able to identify this modification and, so, decide what to do : either adapt the flow to integrate with other columns or just ignore it.

Finally, you will always keep your source files in json format. In that way you can re-process them as you need, export or share in the futur.

ftc · ‎08-03-2022

Thank you very much. From your suggestion, now I am more clearer on what we need do now. One more question, should we use ADF to ingest data via Http linked service to landing data into data lake, this is easy and simple implemention; or we use notebook to call API via then use dataframe to save as JSON, it seems to me that from API response to saving to JSON in the data lake is not that straight forward. Is there any good sample code for best practice implementation? Thanks ahead.

artsheiko · ‎08-04-2022

Hi,

The answer depends on your strategic architecture and team knowledge. Hope the below questions will help you to chose a right solution.

Architecture :

ADF is available only in Azure - what you will do in case if you decide to migrate to another cloud ?
Does API support batch mode ? Be aware that ADF pricing model based on time of activities execution and the quantity of activities run. So, if one day you plan to request 1 M of records from the API one by one, you will need to execute 1M activities (for more, check the azure pricing calculator)
May be Azure Function / Azure Logic Apps / Automation will be the most suitable solution in your case ?
Is it necessary that all data pass only through ADF or you plan to deploy it only for this API (is the ADF a single entry-point for your data today and in the future) ?
How you will manage the API-key / token, will it be Azure Key Vault (in that case, your ADF pipeline will be more complex as before the call to API you will need to set the another web activity to get the key from KV ; please read carefully the caution blocks) or you prefer to use Databricks secret scopes ?

Team :

Do they like more drag-and-drop approach + fill configuration fields or they more tech-people loving git, coding patterns that they can re-use in easy way ?
Let's say, one day the new data must be gathered from another API, what will be the simplest and robust way to implement the solution ?

No particular code snippet - you can save the responses to the format of your choice (file formats supported by auto loader).

Be wise - if your decision will be to take DB Notebook and you don't see the profit to use auto-loader (for example, to dissociate raw data gathering flow from data processing flow allowing you to run and scale both independently), you can write directly to Delta Table and don't create files on ADLS.

Databricks

Multi-Hop Architecture for ingestion data via http API

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI