Databricks Community

devipriya · ‎08-08-2025

NAVIGATION:

Why Data Engineering
The Role of Data Engineering in GenAI
What is Databricks? Unifying Data and AI on One Platform
Databricks on AWS: A Full-Stack Platform for GenAI
Hands-On Exercise
Future-Proofing: Why Data + AI Skills Matter Now More Than Ever

1. Why Data Engineering

Data volumes are exploding. The total global volume of data is set to explode, amounting to 394 zettabytes by 2028, up from just 2 zettabytes in 2010. For context, a zettabyte is equal to 1,000,000,000,000 megabytes. To put it into perspective, one zettabyte could hold 250 billion DVDs

Data engineering is the critical first step, no matter what fancy GenAI app you want to build in 2025. That’s why over 80% of enterprise leaders say success with AI depends on their ability to manage and prepare high-quality data pipelines.

Generative AI models — especially large language models (LLMs) — don’t just need more data. They need the right data: curated, cleaned, and contextually rich. Without it, even the most sophisticated AI produces unreliable or misleading outputs. It’s the classic rule: garbage in, garbage out.

In short, data engineering isn’t a back-office function anymore— it’s the foundation of trustworthy AI. Getting it right is the first step in building any intelligent system that works.

2. The Role of Data Engineering in GenAI: Feeding the AI Engine

If GenAI is the engine, data engineering is the fuel system. It’s not just about having data, but it’s about turning the fragmented and messy data into something that AI can understand and learn from.

In practice, data engineering enables key foundations of any system:

Data ingestion: Collecting data from documents, APIs, databases, and more
Data cleaning: Removing noise, inconsistencies, and duplicates
Structuring and transforming: Organizing data into usable formats
Orchestration: Automating and managing complex data workflows

For GenAI specifically, engineering pipelines are essential for:

Creating high-quality knowledge bases for RAG (Retrieval Augmented Generation)
Generating embeddings for vector search
Ensuring freshness, as AI can’t rely on outdated or incomplete data.

Done right, data engineering makes GenAI more accurate, grounded, and ready for real-world use cases. It’s the invisible infrastructure that transforms AI from a cool demo into a business-ready product.

3. What is Databricks? Unifying Data and AI on One Platform

At its core is the Lakehouse architecture, introduced in 2021 by Databricks, which combines the best of data lakes (flexibility, scalability) and data warehouses (structure, performance). This means you can store vast amounts of raw data and also run high-performance analytics or machine learning on top of it — without duplicating systems.

Press enter or click to view image in full size

pic credit: https://www.databricks.com/glossary/data-lakehouse

What makes Databricks especially powerful for GenAI and modern applications:

✅ Built for both data and AI — You don’t need to stitch together separate tools for ETL, model training, or app deployment. Databricks supports the entire lifecycle — from ingesting raw data to serving LLM-backed apps.
✅ Cloud Agnostic — Databricks works with AWS, Azure, and GCP. In our demo, I’ll walk through an AWS example, and you will see how it natively works with S3, so you can scale securely within your cloud environment.
✅ Support for open-source — Tools like Delta Lake, MLflow, Hugging Face models, and LangChain can all be used natively inside Databricks.

4. Databricks on AWS: A Full-Stack Platform for GenAI

When it comes to building real-world GenAI applications, you need more than just a model — you need a stack: data pipelines, storage, compute, vector search, model serving, governance, and monitoring. This is where Databricks on AWS shines.

Databricks provides an end-to-end environment for building GenAI solutions, fully integrated with AWS services. It gives you all the tools to go from raw data to AI-powered apps, without needing to cobble together multiple platforms.

🔧 What does the full-stack include?

Data ingestion & transformation
Use Databricks Workflows, AutoLoader, and Delta Live Tables to bring in and process structured and unstructured data.
Secure, scalable storage
Seamlessly connect to AWS S3 for data lake storage, with Delta Lake providing ACID guarantees and performance on top.
Compute & orchestration
Run scalable, distributed workloads using EC2-backed clusters, with built-in support for scheduling, retries, and monitoring.
Vector search & embeddings
Generate and store vector embeddings from text and documents using open-source models or foundation models, and search across them efficiently using Databricks’ vector search engine.
LLM integration & serving
Use pre-trained or custom large language models directly within Databricks, and serve them with low latency via MLflow or the Model Serving feature.
Security & governance
Leverage AWS IAM, role-based access controls, data lineage, and built-in governance tools to stay secure and compliant.

🚀 Why it matters

This tight integration means you don’t have to jump between systems or manage fragile handoffs between teams. You can build, test, deploy, and monitor GenAI apps, from prototype to production, entirely within Databricks on AWS.

If I had to wrap up what Databricks in 1 sentence, it is a unified platform for running Data, Analytics, and AI workloads.

5. Hands-On Exercise

Ok, now it is time to put everything we saw into action. This is a big section, so I’ve compartmentalized it into 5a, 5b, 5c, 5d, and 5e.

5a. Environment Setup

Visit Databricks Free Edition | Databricks Documentation and click the existing cloud setup option.

Press enter or click to view image in full size

2. Enter your AWS credentials with which you want to set up Databricks

3. You will be redirected to Marketplace, and download the free 14-day trial version. You will be directed back and forth between the AWS account and the new Databricks account that you will set up. Just follow the instructions. You will now have a Lakehouse set up as shown in the screenshot.

Press enter or click to view image in full size

Databricks Lakehouse

4. Use AWS QuickStart to create a new Databricks Workspace

Press enter or click to view image in full size

5. Click open on the newly created workspace to launch it.

So far, you have set up your AWS account and Databricks account.

Databricks and AWS CLI (optional)

These steps are completely optional, and you can skip if you prefer the UI route.

Setting up Databricks CLI:

Follow the instructions on this page: Install or update the Databricks CLI | Databricks Documentation
For my Windows machine, I followed the instructions above and used the winget command to install it
Post installation, you authenticate. Here are your options: Authentication for the Databricks CLI | Databricks Documentation
I went with Oath. From your CMD, type

databricks auth login

Provide your Databricks Profile Name: Choose any name of your choice, like ‘trial’
Provide your Databricks host: This will be your workspace URL. You will be asked to authenticate again via web UI. With that, your profile ‘trial’ will be saved.

Setting Up AWS CLI:

To check if it is installed: aws --version
To check if it is configured: aws configure list

If Not Configured

Run, aws configure

And enter:

AWS Access Key ID
AWS Secret Access Key
Default region (e.g., us-west-2)
Output format (e.g., JSON or press Enter for default)

🗒️You’re ready to use the AWS CLI with services like S3, EC2, Lambda, and also with Databricks CLI when it needs AWS credentials (e.g. if you’re uploading data to S3 from Databricks).

5b. Data and Workspace Preparation

This section focuses on preparing the necessary data and the workspace structure for the project.

Steps for tutorial

Set up an S3 bucket via AWS console or AWS CLI: aws s3 mb s3://my-support-demo-bucketAnd to verify aws s3 ls

2. In my sample use case, I have 1 structured document and 3 unstructured support documents.

Upload Sample data to S3 via console or AWS CLI:

Structured Tickets: support_tickets.csv

aws s3 cp support_tickets.csv s3://my-support-demo-bucket/source-data/

Unstructured Support Docs: Support documents folder that contains billing_faq.txt, technical_faq.txt, and product_guide.txt

aws s3 cp support_documents/ s3://my-support-demo-bucket/support-docs/ — recursive

3. Workspace set up in Databricks

Create a folder called genai-support-pipeline and store all your notebooks in it.

Workspace/
└── genai-support-pipeline/
 ├── 01_Load_and_Clean_Tickets.py
 ├── 02_Process_Documents.py
 └── README.md

Recap:

Place your CSV and TXT files in S3.
Place all Databricks notebooks/scripts in your workspace folder.
In your notebook code, reference the S3 paths for data access.

5c. Setting Up Your Databricks Cluster

For our use case, we need basic compute + Delta + ML runtime support — no need for GPU or big clusters.

For this GenAI tutorial, I created a lightweight, cost-effective single-node Databricks cluster using the 16.4 LTS ML runtime. I chose the r5dn.large instance type (16 GB RAM, 2 vCPUs), which is perfect for ETL, vector indexing, and small LLM tasks.

To control costs, I set the cluster to auto-terminate after 30 minutes of inactivity. This setup offers just the right balance of power and affordability for a hands-on project like this.

A note on Cluster Costs:

You are charged only when the cluster is running.
If you terminate (stop) the cluster, you stop incurring charges.

5d. Connecting Databricks to S3

4-step process (that took me over 2 hours to figure out):

Ensure you’ve uploaded your docs to the S3 bucket
Navigate to your Catalog section -> External Table -> Create one. Here, you will provide your S3 bucket name and create a new token. Copy the token and proceed with the quick start, CloudFormation template.
As part of the cloud formation template, you will paste the access token you copied.
Wait for a few minutes for the stack to be set up. Now, when you browse your catalog (inside Databricks, you will see the ingested files from S3).

Databricks Cheatsheet

Before you run the notebooks, here is a glossary table to get you oriented with some fields.

Press enter or click to view image in full size

Author’s image

5e. Running the notebooks

You can find the first 2 notebooks on my GitHub repo: https://github.com/lulu3202/databricks_s3_starter/tree/main
Connect your notebook to the cluster we created in the previous step

Press enter or click to view image in full size

Author’s image: Connect your notebook to the cluster (navigate to top right corner)

3. Start running through each notebook to understand various processes

4. Here is the notebook breakdown structure:

📓 Notebook 1: Traditional Data Engineering

Loaded CSVs (e.g., support_tickets.csv) from S3
Converted them to Delta format
Saved them to S3 as external Delta tables
Registered these tables under: workspace.smart_support

📘 Tables Created:

bronze_tickets

💡 Goal:

Structure raw support ticket data for easy analysis, reporting, or ML.

📓 Notebook 2: Unstructured Text Processing

Read text files from S3 (billing_faq.txt, etc.)
Split into paragraphs/chunks
Tagged each with source (billing_faq, etc.)
Saved output as external Delta tables in the schema smart_support

📘 Tables Created:

bronze_billing_faq
bronze_product_guide
bronze_technical_faq

💡 Goal:

Prepare an unstructured knowledge base (for semantic search and RAG).

Press enter or click to view image in full size

Author’s image: Tables created under smart_support schema

OPTIONAL NEXT STEPS: 📓Notebook 3: Setting up a Simple RAG Workflow

As a preparation step for RAG, the documents comprising the knowledge base will be split into chunks and stored in Delta tables. Each chunk will be embedded using a Databricks-hosted embedding model to generate dense vector representations.

These embeddings, along with metadata such as id, source, and content, will be stored in a Delta table. A Vector Search Index is then created in Databricks to enable fast semantic retrieval of the most relevant chunks based on user queries.

🔍 3 Steps to Set Up Vector Search in Databricks:

Create a Vector Search Endpoint

Use VectorSearchClient() to create an endpoint for indexing and searching embeddings.

2. Serve an Embedding Model

Deploy the hosted model databricks-bge-large-en (or similar) using Model Serving for embedding document chunks.

3. Create a Vector Search Index

Use create_delta_sync_index() to link your Delta table with the endpoint and embedding model.
Enables similarity search on the content column using generated embeddings.

Note: You can find notebooks 1 and 2, along with the sample docs, in my GitHub repo (https://github.com/lulu3202/databricks_s3_starter/tree/main)

6. Future-Proofing: Why Data + AI Skills Matter Now More Than Ever

As the lines between data engineering and AI continue to blur, the ability to work across both domains is becoming a defining skill of the modern tech professional. Whether you’re preparing data pipelines, fine-tuning models, or deploying GenAI apps, understanding how data and AI connect is what sets impactful solutions apart from flashy prototypes. Platforms like Databricks — especially when paired with the scale and flexibility of AWS — make it easier than ever to experiment, build, and deploy end-to-end AI systems.

This is why the most in-demand tech professionals today — and tomorrow — are those who can bridge both domains:

Data engineers who understand how their pipelines feed and shape AI behavior.
AI practitioners who appreciate the importance of high-quality, well-governed data.
Builders who can go end-to-end: from ingestion to vector search, from fine-tuning to deployment.

Platforms like Databricks make this convergence accessible. Whether you’re starting with data engineering or exploring LLMs for the first time, Databricks gives you a space to experiment, learn, and build — all in one place, and all on top of trusted infrastructure like AWS.

OUTRO

I was not able to find many beginner-friendly AWS Databricks tutorials and decided to do one myself.
These setup processes took a lot more time than running the actual books. Databricks is a whole other ecosystem, and you will need time to ramp up on it.
This blog went longer than expected, but once again, I wanted to keep it beginner-friendly!
As for the initial setup, navigating between AWS and Databricks was seamless, and clear callouts were given on what needs to be done as part of AWS and Databricks integration

Databricks Community

Pipelines to Prompts: Getting started with Databricks and AWS

1. Why Data Engineering

2. The Role of Data Engineering in GenAI: Feeding the AI Engine

3. What is Databricks? Unifying Data and AI on One Platform

4. Databricks on AWS: A Full-Stack Platform for GenAI

🔧 What does the full-stack include?

🚀 Why it matters

5. Hands-On Exercise

5a. Environment Setup

5b. Data and Workspace Preparation

5c. Setting Up Your Databricks Cluster

5d. Connecting Databricks to S3

5e. Running the notebooks

📓 Notebook 1: Traditional Data Engineering

📓 Notebook 2: Unstructured Text Processing

6. Future-Proofing: Why Data + AI Skills Matter Now More Than Ever

OUTRO

Join Us as a Local Community Builder!

🚀 Weekly Delta (1 - 7 October): A Look Back at This Week’s Top Community Highlights!

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming

Level Up with Databricks Specialist Sessions

Announcing Data Intelligence for Cybersecurity