Databricks Community

WiliamRosa · ‎08-23-2025

Recently I earned the Databricks Machine Learning Professional certification and wanted to share my study journey. Before the exam, I worked on a project as a data engineer alongside data scientists (ML models, LLMs, MLflow). That led me to build a personal RAG project on Databricks, which ended up preparing me for many exam topics. Below is a compact flow of that project plus the official prep guide and a Udemy practice test.

My Databricks RAG Lab - flow summary

1) Goal & stack

- Goal: practice end-to-end (ingest -> embed -> retrieve -> answer) on Databricks

- Stack: Databricks Runtime; PyMuPDF (PDF parsing); Hugging Face E5 via Databricks Model Serving; PostgreSQL + pgvector (vector store); LangChain (RAG); secrets with dbutils.secrets; Jobs/Workflows (orchestration)

2) Ingestion

- Upload PDFs to DBFS (/tmp/.../docs/) with UUID filenames

- Keep upload separated from processing for scale and observability

3) Orchestration

- Databricks Workflow scans the folder and triggers the processor notebook with the file path as a parameter

- Simple, re-runnable pipeline

4) Parsing & chunking

- Extract text with PyMuPDF

- Chunk to fit embedding/token limits and carry metadata (file, user, timestamps)

5) Embeddings (Model Serving)

- Call a Model Serving endpoint with E5 to generate embeddings

- Decouple model choice so you can swap models without rewriting the pipeline

6) Vector storage

- Store chunks + vectors + metadata in PostgreSQL/pgvector

- Use SQL for Top-K similarity; easy to debug and cost-predictable

7) Retrieval (Top-K)

- Embed the question -> run Top-K vector search in pgvector -> fetch relevant chunks

😎 Generation (RAG)

- Build a prompt with question + retrieved chunks and call an LLM endpoint for grounded answers

9) Ops, security, observability

- Secrets via secret scopes (DB creds, endpoints)

- Layout ready for multi-tenant isolation if needed

- Simple metrics (latency, Top-K size, document counts); optional MLflow versioning

10) Why this helped for the exam

- Exercises Jobs/Workflows and Model Serving (orchestration and deployment)

- Hands-on with feature engineering/embeddings and modular pipelines

- MLOps basics: reproducibility, secrets, cost/performance trade-offs

- Practice discussing governance and best practices aligned to exam objectives

Official prep guide:
https://www.databricks.com/learn/certification/machine-learning-professional
Udemy practice test:
https://www.udemy.com/course/databricks-machine-learning-professional-practice-test/

I wish you all the best!

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

TheOC · ‎08-23-2025

@WiliamRosait certainly does help!
Thanks for the insight.

Cheers,
TheOC

View solution in original post

BS_THE_ANALYST · ‎08-23-2025

Thanks a bunch for sharing this @WiliamRosa. I've bookmarked this and I'll be using this as my reference guide when I get deeper into ML later in the year 🤞😏. That project looks so freaking cool by the way!! Bravo, sir👏.

All the best.
BS

View solution in original post

WiliamRosa · ‎08-23-2025

Thanks a lot, my friend @BS_THE_ANALYST ! Really glad you found it useful 🙌. I’m sure when you dive into ML later this year, you’ll do awesome things with it. Appreciate the kind words about the project — means a lot! 🚀

All the best to you too, and let’s keep learning and sharing along the way 💪.

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

View solution in original post

TheOC · ‎08-23-2025

Hey @WiliamRosa , this is a super cool write up, thanks for sharing!
Just curious: did you face any unexpected challenges on this project?

Cheers,
TheOC

WiliamRosa · ‎08-23-2025

Hi @TheOC , thanks for the kind words. I did run into a bit of difficulty choosing the vector store—I went with Postgres/pgvector as a middle ground for response time and volume, with a path to scale later to Aurora. For the exam, they expect familiarity with using Delta tables. Also, there are other open-source vector databases to consider, and the choice should depend on each project’s context. Hope that helps!

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

TheOC · ‎08-23-2025

@WiliamRosait certainly does help!
Thanks for the insight.

Cheers,
TheOC

BS_THE_ANALYST · ‎08-23-2025

Thanks a bunch for sharing this @WiliamRosa. I've bookmarked this and I'll be using this as my reference guide when I get deeper into ML later in the year 🤞😏. That project looks so freaking cool by the way!! Bravo, sir👏.

All the best.
BS

WiliamRosa · ‎08-23-2025

Thanks a lot, my friend @BS_THE_ANALYST ! Really glad you found it useful 🙌. I’m sure when you dive into ML later this year, you’ll do awesome things with it. Appreciate the kind words about the project — means a lot! 🚀

All the best to you too, and let’s keep learning and sharing along the way 💪.

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

Databricks Community

Databricks Machine Learning Professional Preparation

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples