08-08-2025 11:09 PM
NAVIGATION:
Data volumes are exploding. The total global volume of data is set to explode, amounting to 394 zettabytes by 2028, up from just 2 zettabytes in 2010. For context, a zettabyte is equal to 1,000,000,000,000 megabytes. To put it into perspective, one zettabyte could hold 250 billion DVDs
Data engineering is the critical first step, no matter what fancy GenAI app you want to build in 2025. That’s why over 80% of enterprise leaders say success with AI depends on their ability to manage and prepare high-quality data pipelines.
Generative AI models — especially large language models (LLMs) — don’t just need more data. They need the right data: curated, cleaned, and contextually rich. Without it, even the most sophisticated AI produces unreliable or misleading outputs. It’s the classic rule: garbage in, garbage out.
In short, data engineering isn’t a back-office function anymore— it’s the foundation of trustworthy AI. Getting it right is the first step in building any intelligent system that works.
If GenAI is the engine, data engineering is the fuel system. It’s not just about having data, but it’s about turning the fragmented and messy data into something that AI can understand and learn from.
In practice, data engineering enables key foundations of any system:
For GenAI specifically, engineering pipelines are essential for:
Done right, data engineering makes GenAI more accurate, grounded, and ready for real-world use cases. It’s the invisible infrastructure that transforms AI from a cool demo into a business-ready product.
At its core is the Lakehouse architecture, introduced in 2021 by Databricks, which combines the best of data lakes (flexibility, scalability) and data warehouses (structure, performance). This means you can store vast amounts of raw data and also run high-performance analytics or machine learning on top of it — without duplicating systems.
What makes Databricks especially powerful for GenAI and modern applications:
When it comes to building real-world GenAI applications, you need more than just a model — you need a stack: data pipelines, storage, compute, vector search, model serving, governance, and monitoring. This is where Databricks on AWS shines.
Databricks provides an end-to-end environment for building GenAI solutions, fully integrated with AWS services. It gives you all the tools to go from raw data to AI-powered apps, without needing to cobble together multiple platforms.
This tight integration means you don’t have to jump between systems or manage fragile handoffs between teams. You can build, test, deploy, and monitor GenAI apps, from prototype to production, entirely within Databricks on AWS.
If I had to wrap up what Databricks in 1 sentence, it is a unified platform for running Data, Analytics, and AI workloads.
Ok, now it is time to put everything we saw into action. This is a big section, so I’ve compartmentalized it into 5a, 5b, 5c, 5d, and 5e.
2. Enter your AWS credentials with which you want to set up Databricks
3. You will be redirected to Marketplace, and download the free 14-day trial version. You will be directed back and forth between the AWS account and the new Databricks account that you will set up. Just follow the instructions. You will now have a Lakehouse set up as shown in the screenshot.
4. Use AWS QuickStart to create a new Databricks Workspace
5. Click open on the newly created workspace to launch it.
So far, you have set up your AWS account and Databricks account.
Databricks and AWS CLI (optional)
These steps are completely optional, and you can skip if you prefer the UI route.
Setting up Databricks CLI:
databricks auth login
Setting Up AWS CLI:
If Not Configured
Run, aws configure
And enter:
🗒️You’re ready to use the AWS CLI with services like S3, EC2, Lambda, and also with Databricks CLI when it needs AWS credentials (e.g. if you’re uploading data to S3 from Databricks).
This section focuses on preparing the necessary data and the workspace structure for the project.
Steps for tutorial
2. In my sample use case, I have 1 structured document and 3 unstructured support documents.
Upload Sample data to S3 via console or AWS CLI:
Structured Tickets: support_tickets.csv
aws s3 cp support_tickets.csv s3://my-support-demo-bucket/source-data/
Unstructured Support Docs: Support documents folder that contains billing_faq.txt, technical_faq.txt, and product_guide.txt
aws s3 cp support_documents/ s3://my-support-demo-bucket/support-docs/ — recursive
3. Workspace set up in Databricks
Create a folder called genai-support-pipeline and store all your notebooks in it.
Workspace/
└── genai-support-pipeline/
├── 01_Load_and_Clean_Tickets.py
├── 02_Process_Documents.py
└── README.md
Recap:
For our use case, we need basic compute + Delta + ML runtime support — no need for GPU or big clusters.
For this GenAI tutorial, I created a lightweight, cost-effective single-node Databricks cluster using the 16.4 LTS ML runtime. I chose the r5dn.large instance type (16 GB RAM, 2 vCPUs), which is perfect for ETL, vector indexing, and small LLM tasks.
To control costs, I set the cluster to auto-terminate after 30 minutes of inactivity. This setup offers just the right balance of power and affordability for a hands-on project like this.
A note on Cluster Costs:
4-step process (that took me over 2 hours to figure out):
Databricks Cheatsheet
Before you run the notebooks, here is a glossary table to get you oriented with some fields.
3. Start running through each notebook to understand various processes
4. Here is the notebook breakdown structure:
📘 Tables Created:
💡 Goal:
Structure raw support ticket data for easy analysis, reporting, or ML.
📘 Tables Created:
💡 Goal:
Prepare an unstructured knowledge base (for semantic search and RAG).
OPTIONAL NEXT STEPS: 📓Notebook 3: Setting up a Simple RAG Workflow
As a preparation step for RAG, the documents comprising the knowledge base will be split into chunks and stored in Delta tables. Each chunk will be embedded using a Databricks-hosted embedding model to generate dense vector representations.
These embeddings, along with metadata such as id, source, and content, will be stored in a Delta table. A Vector Search Index is then created in Databricks to enable fast semantic retrieval of the most relevant chunks based on user queries.
🔍 3 Steps to Set Up Vector Search in Databricks:
2. Serve an Embedding Model
3. Create a Vector Search Index
Note: You can find notebooks 1 and 2, along with the sample docs, in my GitHub repo (https://github.com/lulu3202/databricks_s3_starter/tree/main)
As the lines between data engineering and AI continue to blur, the ability to work across both domains is becoming a defining skill of the modern tech professional. Whether you’re preparing data pipelines, fine-tuning models, or deploying GenAI apps, understanding how data and AI connect is what sets impactful solutions apart from flashy prototypes. Platforms like Databricks — especially when paired with the scale and flexibility of AWS — make it easier than ever to experiment, build, and deploy end-to-end AI systems.
This is why the most in-demand tech professionals today — and tomorrow — are those who can bridge both domains:
Platforms like Databricks make this convergence accessible. Whether you’re starting with data engineering or exploring LLMs for the first time, Databricks gives you a space to experiment, learn, and build — all in one place, and all on top of trusted infrastructure like AWS.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now