Monday - last edited Monday
Episode 1: Getting Data In
Learning Databricks one brick at a time, using the Free Edition.
Project Intro
Welcome to everyone reading. My nameโs Ben, a.k.a BS_THE_ANALYST, and Iโm going to share my experiences as I learn the world of Databricks. My objective is to master Data Engineering on Databricks, weave in AI & ML, and perform analyses.
I hope this content serves two purposes. Firstly, a motivation for people who want to explore and hone their skills. Secondly, for seasoned Databricks users, Iโm keen to learn best practices. I encourage you to reach out or provide feedback on what youโd do differently. This is the beauty of learning.
Todayโs Objectives
So letโs begin with Getting Data Into Databricks. This step is no small feat; it could be a series in itself, as thereโs much to consider depending on project requirements.
In todayโs world, Unity Catalog is the recommended way to manage and store data in Databricks. For older users not leveraging Unity Catalog, DBFS or mounted cloud storage are still valid approaches. For those wondering, DBFS was the default storage location for the Unity Catalogโs predecessor, the hive_metastore. If youโd like a solid breakdown of Unity Catalog, hereโs a great read:
๐Understanding Unity Catalog
Project
Picture this: Iโm a user wanting to get some data into Databricks. On my computer, Iโve got some flat files (CSVs, maybe an XLSX). The first question I ask myself:
Do I want to manually upload the data, or do it programmatically?
Typically, manual is the route to prove a concept works. If I need automation later, Iโll convert that process into something programmatic.
Manually Uploading CSV (Data) to a Unity Catalog Volume
Uploading manually is straightforward:
In the GIF below, you can see me selecting a Catalog, selecting (or create) a schema, selecting (or create a Volume). Then select file to upload.
Just like that, your file will be available to use in Databricks. You can view your data in a notebook or using the SQL editor like I am below:
Programmatically Uploading CSV (data) to a Unity Catalog Volume
DISCUSSION
Weโve got a few choices to upload data programmatically from our local machine to Databricks. The CLI, SDK, or API. Hereโs a great explanation about what each of them offers: https://alexott.blogspot.com/2024/09/databricks-sdks-vs-cli-vs-rest-apis-vs.html
For today, weโll be using the Databricks CLI. Thereโs fantastic documentation on how to do this: https://docs.databricks.com/aws/en/dev-tools/cli/tutorial.
Let me just point out the following useful parts:
In the navigation pane on the Left Hand Side youโll find the tutorial that helps you install and set up the CLI to access your Databricks Environment. Make sure to select the correct Operating System as seen in the picture below. The Command Reference section contains EVERYTHING. Itโs the gold mine for CLI commands.
AUTHENTICATION
If youโre curious, I authenticated with a Personal Access Token:
Generate a Personal Access Token ๐
https://docs.databricks.com/aws/en/dev-tools/auth/pat
EXAMPLE OF ME USING THE CLI TO UPLOAD A CSV INTO A VOLUME
Installing
1. Iโm on Windows and needed to download the CLI. I used command prompt and command promptโs winget to download the CLI:
Authenticating
2. Enter your databricks host and personal access token to authenticate
Creating a Schema in a Catalog
3. Using the docs with CLI Commands, I want to create a schema (database) and a volume within it. My catalog is called Workspace. If you want to create a fresh catalog, you can use the CLI if youโd like. Recall, thereโs a 3 level name-space with the unity catalog: Catalog>Schema>Volume/Table/Model
https://docs.databricks.com/aws/en/dev-tools/cli/reference/
Action ๐ I'm Creating a Schema in my Catalog called โZW_Bootcampโ, note catalog is called "workspace"
Creating a Volume in a Schema
4. Create Volume called "datadumps" in the Schema called "ZW_Bootcamp"
Uploading CSV from local machine into Volume
5. Upload CSV into the Volume you created. For this, we'll be using the FS (file system) command group, which allows us to perform file system operations. Documentation here: https://docs.databricks.com/aws/en/dev-tools/cli/reference/fs-commands
... note, I got caught out on this part, make sure you prefix your volume path with "DBFS" as seen in the pictures below. The top line is my command
databricks fs cp "{PATH TO CSV HERE}" "DBFS:/Volumes/{catalog}/{schema}/{volume}"
Verify your upload ๐๐ฅณ๐พ
6. Check the Databricks UI to see if the upload was successful and voila
Till next time
Thatโs all for now, folks! Thereโs still plenty to uncover! What about hitting data sources that arenโt on our machines? How do we interact with APIs? Where do we store the credentials in our pipelines? How do we interact with Databases? Thereโs still so much in store for connecting to Data. How do we automate everything? So much to learn, so little time.
All the best,
BS
Tuesday
Super cool blog @BS_THE_ANALYST
I have to confess, I haven't touched the CLI as much as I'd have liked to. This gives me confidence to go do that though.
Cool to see both ways: Simple via the UI and programmatic through the CLI. Excited to see the next instalment ๐ซก
Tuesday
Hey @TheOC ๐.
Not sure if you caught this link in the blog: https://alexott.blogspot.com/2024/09/databricks-sdks-vs-cli-vs-rest-apis-vs.html
I'd really advise having a 5 min read through that, it's fab: "CLI is also a home for Databricks Asset Bundles". There's a great discussion around SDK vs CLI vs API.
Looking forward to seeing your next blog, I liked the episode about the Widgets, they're awesome ๐.
All the best,
BS
Tuesday
Wonderfully illustrated, @BS_THE_ANALYST, canโt wait for the next episode of your One Brick at a Time series!!!
Tuesday
Thanks @Advika! I'm looking forward to doing the next one already ๐. Blogging is a new area for me, feedback is more than welcome. I appreciate this post was pretty long ๐คช๐.
All the best,
BS
yesterday
@BS_THE_ANALYST , It looks awesome, I am waiting for Agent bricks to launch in our region , so that I can also try new things, this will be a good starter
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now