cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Multi-Hop Architecture for ingestion data via http API

ftc
New Contributor II

I'd like to know what is the design pattern for ingesting data via http API request. The pattern needs use multi-hop architecture. Do we need ingest JSON output to cloud storage first (not bronze layer), then use auto loader to process data further? Or, we ingest data as JSON format for bronze layer, then process further? Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

artsheiko
Databricks Employee
Databricks Employee

The API -> Cloud Storage -> Delta is more suitable approach.

Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference evolution, supports files metadata and you can easily switch to batch processing using .trigger(once=True) or .trigger(availableNow=True).

In addition, the rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column contains any data that wasn’t parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the schema. So, if something data is added or changed in a source API you will be able to identify this modification and, so, decide what to do : either adapt the flow to integrate with other columns or just ignore it.

Finally, you will always keep your source files in json format. In that way you can re-process them as you need, export or share in the futur.

View solution in original post

3 REPLIES 3

artsheiko
Databricks Employee
Databricks Employee

The API -> Cloud Storage -> Delta is more suitable approach.

Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference evolution, supports files metadata and you can easily switch to batch processing using .trigger(once=True) or .trigger(availableNow=True).

In addition, the rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column contains any data that wasn’t parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the schema. So, if something data is added or changed in a source API you will be able to identify this modification and, so, decide what to do : either adapt the flow to integrate with other columns or just ignore it.

Finally, you will always keep your source files in json format. In that way you can re-process them as you need, export or share in the futur.

ftc
New Contributor II

Thank you very much. From your suggestion, now I am more clearer on what we need do now. One more question, should we use ADF to ingest data via Http linked service to landing data into data lake, this is easy and simple implemention; or we use notebook to call API via then use dataframe to save as JSON, it seems to me that from API response to saving to JSON in the data lake is not that straight forward. Is there any good sample code for best practice implementation? Thanks ahead.

artsheiko
Databricks Employee
Databricks Employee

Hi,

The answer depends on your strategic architecture and team knowledge. Hope the below questions will help you to chose a right solution.

Architecture :

  • ADF is available only in Azure - what you will do in case if you decide to migrate to another cloud ?
  • Does API support batch mode ? Be aware that ADF pricing model based on time of activities execution and the quantity of activities run. So, if one day you plan to request 1 M of records from the API one by one, you will need to execute 1M activities (for more, check the azure pricing calculator)
  • May be Azure Function / Azure Logic Apps / Automation will be the most suitable solution in your case ?
  • Is it necessary that all data pass only through ADF or you plan to deploy it only for this API (is the ADF a single entry-point for your data today and in the future) ?
  • How you will manage the API-key / token, will it be Azure Key Vault (in that case, your ADF pipeline will be more complex as before the call to API you will need to set the another web activity to get the key from KV ; please read carefully the caution blocks) or you prefer to use Databricks secret scopes ?

Team :

  • Do they like more drag-and-drop approach + fill configuration fields or they more tech-people loving git, coding patterns that they can re-use in easy way ?
  • Let's say, one day the new data must be gathered from another API, what will be the simplest and robust way to implement the solution ?

No particular code snippet - you can save the responses to the format of your choice (file formats supported by auto loader).

Be wise - if your decision will be to take DB Notebook and you don't see the profit to use auto-loader (for example, to dissociate raw data gathering flow from data processing flow allowing you to run and scale both independently), you can write directly to Delta Table and don't create files on ADLS.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group