08-23-2025 04:09 AM
Recently I earned the Databricks Machine Learning Professional certification and wanted to share my study journey. Before the exam, I worked on a project as a data engineer alongside data scientists (ML models, LLMs, MLflow). That led me to build a personal RAG project on Databricks, which ended up preparing me for many exam topics. Below is a compact flow of that project plus the official prep guide and a Udemy practice test.
My Databricks RAG Lab - flow summary
1) Goal & stack
- Goal: practice end-to-end (ingest -> embed -> retrieve -> answer) on Databricks
- Stack: Databricks Runtime; PyMuPDF (PDF parsing); Hugging Face E5 via Databricks Model Serving; PostgreSQL + pgvector (vector store); LangChain (RAG); secrets with dbutils.secrets; Jobs/Workflows (orchestration)
2) Ingestion
- Upload PDFs to DBFS (/tmp/.../docs/) with UUID filenames
- Keep upload separated from processing for scale and observability
3) Orchestration
- Databricks Workflow scans the folder and triggers the processor notebook with the file path as a parameter
- Simple, re-runnable pipeline
4) Parsing & chunking
- Extract text with PyMuPDF
- Chunk to fit embedding/token limits and carry metadata (file, user, timestamps)
5) Embeddings (Model Serving)
- Call a Model Serving endpoint with E5 to generate embeddings
- Decouple model choice so you can swap models without rewriting the pipeline
6) Vector storage
- Store chunks + vectors + metadata in PostgreSQL/pgvector
- Use SQL for Top-K similarity; easy to debug and cost-predictable
7) Retrieval (Top-K)
- Embed the question -> run Top-K vector search in pgvector -> fetch relevant chunks
😎 Generation (RAG)
- Build a prompt with question + retrieved chunks and call an LLM endpoint for grounded answers
9) Ops, security, observability
- Secrets via secret scopes (DB creds, endpoints)
- Layout ready for multi-tenant isolation if needed
- Simple metrics (latency, Top-K size, document counts); optional MLflow versioning
10) Why this helped for the exam
- Exercises Jobs/Workflows and Model Serving (orchestration and deployment)
- Hands-on with feature engineering/embeddings and modular pipelines
- MLOps basics: reproducibility, secrets, cost/performance trade-offs
- Practice discussing governance and best practices aligned to exam objectives
Official prep guide:
https://www.databricks.com/learn/certification/machine-learning-professional
Udemy practice test:
https://www.udemy.com/course/databricks-machine-learning-professional-practice-test/
I wish you all the best!
08-23-2025 04:30 AM
08-23-2025 06:21 AM
Thanks a bunch for sharing this @WiliamRosa. I've bookmarked this and I'll be using this as my reference guide when I get deeper into ML later in the year 🤞😏. That project looks so freaking cool by the way!! Bravo, sir👏.
All the best.
BS
08-23-2025 11:08 AM
Thanks a lot, my friend @BS_THE_ANALYST ! Really glad you found it useful 🙌. I’m sure when you dive into ML later this year, you’ll do awesome things with it. Appreciate the kind words about the project — means a lot! 🚀
All the best to you too, and let’s keep learning and sharing along the way 💪.
08-23-2025 04:14 AM
Hey @WiliamRosa , this is a super cool write up, thanks for sharing!
Just curious: did you face any unexpected challenges on this project?
08-23-2025 04:24 AM
Hi @TheOC , thanks for the kind words. I did run into a bit of difficulty choosing the vector store—I went with Postgres/pgvector as a middle ground for response time and volume, with a path to scale later to Aurora. For the exam, they expect familiarity with using Delta tables. Also, there are other open-source vector databases to consider, and the choice should depend on each project’s context. Hope that helps!
08-23-2025 04:30 AM
08-23-2025 06:21 AM
Thanks a bunch for sharing this @WiliamRosa. I've bookmarked this and I'll be using this as my reference guide when I get deeper into ML later in the year 🤞😏. That project looks so freaking cool by the way!! Bravo, sir👏.
All the best.
BS
08-23-2025 11:08 AM
Thanks a lot, my friend @BS_THE_ANALYST ! Really glad you found it useful 🙌. I’m sure when you dive into ML later this year, you’ll do awesome things with it. Appreciate the kind words about the project — means a lot! 🚀
All the best to you too, and let’s keep learning and sharing along the way 💪.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now