cancel
Showing results for 
Search instead for 
Did you mean: 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

(Episode 3: Hands-on API Project) - Learning Databricks one brick at a time, using the Free Edition

BS_THE_ANALYST
Esteemed Contributor II

Episode 3: APIs
Learning Databricks one brick at a time, using the Free Edition.

Project Intro
Welcome to everyone reading. My name’s Ben, a.k.a BS_THE_ANALYST, and I’m going to share my experiences as I learn the world of Databricks. My objective is to master Data Engineering on Databricks, weave in AI & ML, and perform analyses.

I hope this content serves two purposes. Firstly, a motivation for people who want to explore and hone their skills. Secondly, for seasoned Databricks users, I’m keen to learn best practices. I encourage you to reach out or provide feedback on what you’d do differently. This is the beauty of learning.

Today’s Objectives
Today, we're jumping straight into a project where we'll build a movie recommendation system.

In Part 1, we'll hit an API to retrieve recommended movies, load the API's responses into a delta table; so we never forget the recommended gems. Finally, in Part 2, we'll receive a custom e-mail detailing our recommended movies OR we'll build an AI/BI dashboard! (maybe both mwahaha)

What's an API? (a noob's understanding)
Let's start with the technical name and then I'll provide my non-technical understanding . So, API: Application Programming Interface. It's a set of rules/tools that allow applications to interact with one another. 

When I was first wrapping my head around this I thought of it like ordering a meal in a restaurant. If I place an order directly to a kitchen they might just chuck the raw ingredients back at me because I didn't comply with their ordering system. Instead, I provide my order to a waiter, the waiter passes this to the kitchen, where the request is properly understood, and my food arrives at my table as I'd have expected.

So if an application makes a request to another application, imagine that the API is like the waiter that sits between them, the middleman if you like. They translate our orders "requests" so the recipient system understands and gives the food back "the response" in a nice, well expected format to the requestee .

If you're following me still, we'll take it up one notch. Imagine we're running short on time. We aren't going to be able to make our reservation at the restaurant as we're running late, unfortunately the kids didn't make it out of door and threw up all over the family cat! Instead, we're going to have to order a take out from the restaurant. The restaurant are only happy to fulfil the order if we can prove who we are. This concept is called authentication.

There's various ways to authenticate when making requests to APIs: Basic Auth & OAuth2 are common. Which Authorisation you choose depends on what the API offers. For today, we'll create an account for the API, which is typically a first step so you can get an API KEY this key is used with both Basic and OAuth2, OAuth2 just requires some extra hoops to jump through. Naturally, this API KEY, is associated to the account you sign up for. It'll have an audit as to how much data you request and will often rate limit the amount you can request, just so you don't overwhelm the server! Some APIs cost money when you make requests. Make sure you keep your API KEY protected. You can use things like Databricks' Secret Manager to bring passwords into your scripts and keep them protected.

Project Steps

1. Setting up API Credentials
First things first: we're going to need to acquire an API-KEY from the Movies API that I've selected. I picked this particular API as it supports free use. Unfortunately, IMDB, a popular rating site, uses a paid for API.

Below, I signed up by clicking the 1st link above for getting started.
BS_THE_ANALYST_0-1761119041020.png

Then it asks me to setup an "application", I chose personal use. This isn't uncommon when signing up for an API key but not every website makes you setup an application, it's just another hoop to jump through. Applications, quite often, allow you to maintain many user accounts, useful for proper commercial use! I just put in test values in the fields during the application sign up, fill in as you'd like.

Now, to retrieve my API-KEY:
1.
This pops up after creating the application:

BS_THE_ANALYST_1-1761119068567.png

2. It redirects you to here or you can just navigate

BS_THE_ANALYST_2-1761119187093.png

 

So, why are there two API components in the picture above?! APIs tend to be different!
Above, in the API Read Acess Token: you see the first characters as "ey __blahblah__" this typically means that you have a Token. Where do tokens come into this process? Well, typically, they come into the OAuth2 authorisation process. Sometimes, you'll use your API KEY to generate this token as part of your workflow:
1. User Makes Request to API Authorisation Endpoint with API Key
2. API returns a Token for the user if Authorisation is approved, typically JSON response.
3. User makes request to desired API with Token

The Token often has a short life and expires after a certain period. However, since we've actually just got a Token assigned to our account, it'll likely have a long life. So, essentially, we'll be using OAuth2 today but the initial steps are being handled for us, this is a first for me! (part of the fun). The steps above are a general rule of thumb. What I've learned with using APIs is that you need to poke around a little to discover how to exactly use them. Some are super simple! Some take some extra steps. Don't worry, if you dig around on the Developer Website, it'll often guide you how to setup the authorisation if it isn't obvious. Once you've authorised to a couple different APIs, they all roughly use the same setups.

2. Selecting Endpoint + Typical API Developer Hub Tips + POSTMAN!!
Typical API Developer Hub Tips
So, first off, you should sign into the Developer Hub using your account in the previous steps. You should see the  following:

BS_THE_ANALYST_1-1761122579858.png

1. API Reference contains all the endpoints + authentication endpoints, should we require them shortly.
2. Authentication tab, pretty cool, you can see it's got my API Key in.
3. API Key
4. API Read Access Token from a previous step, cool. Also, notice the arrow from point 4, it says Authorization: Bearer {api read access token}. When you see "bearer" in the header, typically, this means you're using OAuth2. If it was Basic, it'd say "Basic" as a prefix instead and the API KEY (or base64 encoded API KEY) would proceed it.

Notice there's a TRY IT button? The API Developer Hubs are awesome. Before building anything, you can just play around with the endpoints. Then you can just export the Python code straight from the window or construct it to suit our use case! I was gutted I didn't know about using them sooner. They save so much time.

Selecting an Endpoint 
Here we are folks, finally, we're about to GET something after all this hard work of setting things up (phew). So, which endpoint should we "GET" information from? Below are the ones that took my fancy:

BS_THE_ANALYST_3-1761123310439.png

As you can see above, I've selected the Top Rated endpoint. Notice the "GET" next to it, this means it's a GET request, these are the most common requests to retrieve information. I'd encourage you check the other types out at some point i.e. POST, DELETE, PUT. 

On the right hand side, you can see I've clicked "Try It" and it's provided me with a "JSON" response. Typically, you'll recognise it's JSON if I copied that entire content out from the response and looked at the start/end of it ... it'll be surrounded in some big squiggly brackets {}. In here, "JSON", info is organised in key value pairs i.e. facts about me: {"age": 31, "location": { "city": Newcastle, "country": England} }, it can have nested info and it's really simple to traverse. 

For the movie fanatics amongst you, you'll see the #1 rated movie is The Shawshank Redemption, absolute banger! 

But how many movies have I been given back in the response? This depends on the API! Two questions might arise now:
1. How do I actually view all of this JSON Response information in a coherent way?
2. How do I make this api request outside of the developer hub?

Now, I am going to give you a whistlestop tour of a VERY popular platform for making your API requests: POSTMAN (https://www.postman.com/) ... 
read through my steps below, afterwards, create an account and then follow along. Of course, you can skip using Postman, and jump straight into the Databricks section, but I'd urge you to see the steps below and see how cool it is!


POSTMAN
1. Once logged in, jump into your workspace

BS_THE_ANALYST_4-1761123396472.png


2. Click the "+" in the picture below to setup the window for making a request:

BS_THE_ANALYST_5-1761123403931.png

 

3. Jump back to the Developer Page, copy the Get Request as Shell.
This is basically copying all the appropriate info across to make a request on a different system. If we were using Databricks, I'd pick Python as it supports this. CURL/Shell is always easiest in Postman, in my experience. So copy that over.

BS_THE_ANALYST_6-1761123420977.png

 

4. Paste that copied value into Postman (where my arrow is with number 1) and it will AUTOFILL all the information you require to make the request. Then simply press Send and voila, you've made the API Request from a different system. You'll see the response at the very bottom of your screen

  BS_THE_ANALYST_7-1761123429024.png


5. Now comes the magic
Below, you can see that if I select Preview it collapses the JSON into an incredibly readable format!!!! You can click on the various components to see if they expand! 

BS_THE_ANALYST_8-1761123439602.png

Now, onto the important stuff. You'll notice it's returned some key information. When we made this request, it was auto populated with Page = 1, of course, we can change this by putting page = 2 or even iterating over the pages! There's 521 pages in total. That would mean I'd need to make 521 requests. If I ran a For Loop, these would be sent FAST. This can overwhelm systems, especially if lots of users are doing this. To prevent this, often, each developer account has a rate limit against it. Typically, if you pay money, you get a higher rate limit etc. For today, I only care about the first 10 pages, so I'll iterate over the first 10 pages. If you know you need to rate limit, just put a python method in to throttle your requests i.e. when iteration number mod 10 = 0 then .sleep(3 minutes), i.e. every batch of 10 requests, wait for 3 mins. 

So, I'm happy here. I want to retrieve 10 pages worth of movie responses. Each response will need to be stored. I'll create a delta table in Databricks from it. This type of requesting mechanism is called Paginated Requests. Sometimes, you just have 1 page, but can have a big LIMIT i.e. limit = 10000. It just depends on the API. Some even just let you retrieve everything all in one request.

3. Automating Paginated API Requests in Databricks

Based on our previous section, we saw how to copy a request from our Developer Hub to Postman. Let's do a similar thing by copying the example code provided as Python Script, previously we selected Shell. 

Now, open yourself up a notebook in your Databricks environment, I'll be using the Free Edition. Copy that code over like so:

BS_THE_ANALYST_9-1761123530524.png

Copying into a notebook:

BS_THE_ANALYST_10-1761123566869.png

 Then spruce this code up a tad! I've converted it into a While Loop to iterate 10 pages of requests. I concatenate all the responses of the requests together and store them in a list. I put the list into a Pandas DataFrame, and finally, write out to a Delta Table in Databricks + I read it back in for good measure haha!

BS_THE_ANALYST_11-1761123580989.png

Voila!

4. Summary and next steps
So, there we have it folks. I was most certainly learning this as I went along. I hope you enjoyed it! 

There's plenty of room for improvement here. We need to store these API KEYS in databricks better, you shouldn't have the secrets in plain-sight. What about delta loading into our delta table? We're effectively using batch ingestion and we'll reingest certain records, which is a waste of resource. What about the other movie endpoints? We only hit a single one!!

Databricks currently have external credentials for HTTP connections (which we use for API requests) in public preview: (https://docs.databricks.com/aws/en/query-federation/http#gsc.tab=0) and, if not, you could level this up by using an external key vault i.e. Azure Key Vault or even Databricks' Built in Secret Manager!! I'll be writing more about Databricks' Secret Manager in a subsequent blog but here's the link if you want a head start:  (https://docs.databricks.com/aws/en/security/secrets/#gsc.tab=0).

Next time, in part 2, we'll round this off by providing ourselves with a movie recommendation system, based on our delta table, either using Emails or using AI/BI. There's so many avenues here.

All the best,
BS

6 REPLIES 6

TheOC
Contributor III

This is super cool, thanks for sharing @BS_THE_ANALYST 

I've already got ideas of new data sources I can hook up to using API connections from inside Databricks, both to enrich some of my existing projects (Postcode lookups, National Survey/census data, weather data), but also for some completely new data projects.

Thanks for showing exactly how you did it too, It makes it very easy to replicate. 

Cheers,
TheOC

BS_THE_ANALYST
Esteemed Contributor II

Looking forward to seeing the Data Project @TheOC. Always looking for inspiration 🌚

All the best,
BS

Sujitha
Databricks Employee
Databricks Employee

WOW @BS_THE_ANALYST 
This is an excellent walkthrough, especially the way you simplified API concepts for beginners while showing how to implement them in Databricks. Great learning content for anyone exploring the Free Edition!

BS_THE_ANALYST
Esteemed Contributor II

Thanks @Sujitha! 😀.

Really enjoyed this one. I'm planning on having a blog at least every two weeks. Ideally, one per week is the dream 😏. Perhaps some video-style content will be good to complement the blogs.

All the best,
BS

JoyO
Visitor

This is great, thanks for sharing Ben, will share with my data community.

BS_THE_ANALYST
Esteemed Contributor II

Awesome @JoyO! 🙂.

If anyone has questions, feel free to reach out.

All the best,
BS