cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Ingesting data from APIs

Anonym40
New Contributor II

Hi, 
I need to ingest some data available at API endpoint. 
I was thinking of this option - 
1. make API call from Notebook and save data to ADLS
2. use AutoLoader to load data from ADLS location. 
But then, i have some doubts - like I can directly write the api response to table,
then Is writing to ADLS is an unnecessary step ? 
Then I thought if I drop the table, maybe I can use the files in ADLS to reload the table. 
Then again, i can restore from Version instead of using something like copy Into using files in ADLS ? 
 Which approach should I take then ? How are other people doing it ? 
Thanks,

2 REPLIES 2

szymon_dybczak
Esteemed Contributor III

Hi @Anonym40 ,

There’s no silver bullet here. It’s rather a matter of opinion. I would prefer the first of the approaches mentioned - that is, a separate process responsible for extracting data from the API and saving it to the data lake. Then, use Auto Loader to process the data into the bronze layer.

With this approach, if there is any need to reload the data, you’ll have it readily available at lake.

Another argument is that for some APIs it is not possible to retrieve data older than a certain period (for example, anything older than one 3 month may no longer be available).

If you were to write the data directly to a table and had a minor bug in the code parsing the response that went unnoticed for a longer time, you would no longer be able to correct the data. If you every time extracting data in unchanged format from source - then you have an easy way to rebuild entire table in case of any issue.

Raman_Unifeye
Contributor III

@Anonym40 - its generally a good idea to break the direct API calls to your rest of the data pipeline. By staging the data to ADLS, you are protecting your downstream to upstream processes and getting more restartability/maintenance in your e2e flow. Also, if you ever need to use the staged data (Data science or any other team), its still availalbe to consume.

Rest I agree with @szymon_dybczak 


RG #Driving Business Outcomes with Data Intelligence