cancel
Showing results for 
Search instead for 
Did you mean: 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Machine Learning Professional Preparation

WiliamRosa
New Contributor III

WiliamRosa_0-1755947321744.png

Recently I earned the Databricks Machine Learning Professional certification and wanted to share my study journey. Before the exam, I worked on a project as a data engineer alongside data scientists (ML models, LLMs, MLflow). That led me to build a personal RAG project on Databricks, which ended up preparing me for many exam topics. Below is a compact flow of that project plus the official prep guide and a Udemy practice test. 

My Databricks RAG Lab - flow summary 

1) Goal & stack 

- Goal: practice end-to-end (ingest -> embed -> retrieve -> answer) on Databricks 

- Stack: Databricks Runtime; PyMuPDF (PDF parsing); Hugging Face E5 via Databricks Model Serving; PostgreSQL + pgvector (vector store); LangChain (RAG); secrets with dbutils.secrets; Jobs/Workflows (orchestration) 

2) Ingestion 

- Upload PDFs to DBFS (/tmp/.../docs/) with UUID filenames 

- Keep upload separated from processing for scale and observability 

3) Orchestration 

- Databricks Workflow scans the folder and triggers the processor notebook with the file path as a parameter 

- Simple, re-runnable pipeline 

4) Parsing & chunking 

- Extract text with PyMuPDF 

- Chunk to fit embedding/token limits and carry metadata (file, user, timestamps) 

5) Embeddings (Model Serving) 

- Call a Model Serving endpoint with E5 to generate embeddings 

- Decouple model choice so you can swap models without rewriting the pipeline 

6) Vector storage 

- Store chunks + vectors + metadata in PostgreSQL/pgvector 

- Use SQL for Top-K similarity; easy to debug and cost-predictable 

7) Retrieval (Top-K) 

- Embed the question -> run Top-K vector search in pgvector -> fetch relevant chunks 

😎 Generation (RAG) 

- Build a prompt with question + retrieved chunks and call an LLM endpoint for grounded answers 

9) Ops, security, observability 

- Secrets via secret scopes (DB creds, endpoints) 

- Layout ready for multi-tenant isolation if needed 

- Simple metrics (latency, Top-K size, document counts); optional MLflow versioning 

10) Why this helped for the exam 

- Exercises Jobs/Workflows and Model Serving (orchestration and deployment) 

- Hands-on with feature engineering/embeddings and modular pipelines 

- MLOps basics: reproducibility, secrets, cost/performance trade-offs 

- Practice discussing governance and best practices aligned to exam objectives 

Official prep guide: 
https://www.databricks.com/learn/certification/machine-learning-professional 
Udemy practice test: 
https://www.udemy.com/course/databricks-machine-learning-professional-practice-test/ 

I wish you all the best!   

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa
3 ACCEPTED SOLUTIONS

Accepted Solutions

@WiliamRosait certainly does help! 
Thanks for the insight.

Cheers,
TheOC

View solution in original post

BS_THE_ANALYST
Esteemed Contributor

Thanks a bunch for sharing this @WiliamRosa. I've bookmarked this and I'll be using this as my reference guide when I get deeper into ML later in the year 🤞😏. That project looks so freaking cool by the way!! Bravo, sir👏

All the best.
BS

View solution in original post

WiliamRosa
New Contributor III

Thanks a lot, my friend @BS_THE_ANALYST ! Really glad you found it useful 🙌. I’m sure when you dive into ML later this year, you’ll do awesome things with it. Appreciate the kind words about the project — means a lot! 🚀

All the best to you too, and let’s keep learning and sharing along the way 💪.

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

View solution in original post

5 REPLIES 5

TheOC
Contributor III

Hey @WiliamRosa , this is a super cool write up, thanks for sharing! 
Just curious: did you face any unexpected challenges on this project?

Cheers,
TheOC

WiliamRosa
New Contributor III

Hi @TheOC , thanks for the kind words. I did run into a bit of difficulty choosing the vector store—I went with Postgres/pgvector as a middle ground for response time and volume, with a path to scale later to Aurora. For the exam, they expect familiarity with using Delta tables. Also, there are other open-source vector databases to consider, and the choice should depend on each project’s context. Hope that helps!

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

@WiliamRosait certainly does help! 
Thanks for the insight.

Cheers,
TheOC

BS_THE_ANALYST
Esteemed Contributor

Thanks a bunch for sharing this @WiliamRosa. I've bookmarked this and I'll be using this as my reference guide when I get deeper into ML later in the year 🤞😏. That project looks so freaking cool by the way!! Bravo, sir👏

All the best.
BS

WiliamRosa
New Contributor III

Thanks a lot, my friend @BS_THE_ANALYST ! Really glad you found it useful 🙌. I’m sure when you dive into ML later this year, you’ll do awesome things with it. Appreciate the kind words about the project — means a lot! 🚀

All the best to you too, and let’s keep learning and sharing along the way 💪.

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa