Databricks Community

hasnat_unifeye · 3 weeks ago

Hi everyone,

For the Databricks Free Edition Hackathon, I wanted to show that traditional ML still has a big role today, and how it can work hand-in-hand with Databricks’ newer AI tooling. As a concrete use case, I created a recipe recommendation engine that helps suggest relevant recipes to users, using classic NLP and topic modelling to structure the data, then AI/BI Genie to bring that value out for end users. Both work together rather than instead of each other. Ive always had an Interest in NLP tools in analysis classical Arabic texts, but ive never built and end to end solution in Databricks that could bring an NLP solution to life, so i thought this was a great chance to do demonstrate it.

What I built

I built a recipe recommendation engine the following components:

Lakeflow Declarative Pipelines (LDP) to ingest and prepare text data using a madallion architecture.

PySpark ML with an LDA topic model to discover themes in recipes, built as a job in databricks.

AI/BI Genie on top to explore and get recommendations in natural language

All of this runs on Databricks Free Edition.

Data & preparation (LDP + tokenising)

The starting point was a Kaggle recipes dataset with titles, descriptions and ingredients.

Using LDP, I set up a simple pipeline to:

Bronze - Ingest the raw data

Silver - Clean obvious issues (duplicates, missing key fields)

Silver - Focus on the crucial NLP step: tokenising the text

Gold – Aggregates which includes downstream tables with the title and the words (tokens) associated to the recipes

By tokenising, I mean:

Breaking text into meaningful words

Example: "Spicy tomato pasta with fresh basil"
["spicy", "tomato", "pasta", "fresh", "basil"]

Removing noise (Stop words)

Stripping out filler words like “and”, “with”, “the” - We call these stop words. Stop words such as “and” don't really bring value to classification. Systems

Normalising similar word forms

Treating "cook", "cooks", "cooked", "cooking" more like the same underlying word, so the model focuses on the concept rather than inflected forms.

After building the initial pipeline, I continuously monitored and iterated on the data rather than assuming it was “done”:

I used Genie to ask questions like “Which very common words bring little or no value to the corpus?” to surface extra stopwords that weren’t helpful for modelling.

I also generated a word cloud image over the tokens and discovered that some values were actually encoded fractions like u00bd (½) leaking into the text. That fed back into the cleaning logic in my LDP so these artefacts were stripped out.

This showed that working with text data is rarely straightforward: you often need an iterative loop of inspecting the data, tightening the cleaning and tokenisation, and re-running the pipeline until the corpus looks meaningful for downstream ML.

Topic modelling with PySpark ML (LDA)

Once the recipes were tokenised, I used PySpark ML to apply a classic NLP technique called Latent Dirichlet Allocation (LDA).

At a high level, LDA assumes that:

Each recipe is a mixture of a few underlying topics (for example “pasta”, “curries”, “baking”), and

Each topic is defined by a particular set of words that tend to appear together (e.g. a “pasta” topic might be dominated by words like pasta, tomato, garlic, olive oil).

The approach I took was:

Turn tokens into numeric features

From the token lists, I built a simple count-based representation: for each recipe, how often each word appears. This gives the model a structured view of the language used in the dataset, rather than raw text.

Fit an LDA model over the whole corpus

LDA looks across all recipes and learns a fixed number of topics. It doesn’t know anything about “Italian” or “dessert” explicitly – it discovers topics purely from patterns in word co-occurrence.

Assign each recipe a topic profile

The model then gives every recipe a topic distribution – for example, a recipe might be 70% “pasta/Italian”, 20% “quick midweek meals”, 10% “vegetarian”. This topic profile becomes a compact, semantic fingerprint for that recipe.

These topic profiles are what I use for the recommendation engine: recipes with similar topic distributions are treated as similar, so I can recommend recipes that share underlying themes and flavour profiles, not just recipes with exactly the same ingredients.

Bringing it together with Genie

To make this useful beyond notebooks, I added AI/BI Genie on top of the curated tables:

Genie understands the recipe attributes and topic features.

You can ask questions like:

“Recommend three vegetarian recipes similar to Spicy Chickpea Curry.”

“Show quick pasta dishes with a similar flavour profile to Garlic Shrimp Pasta.”

Genie converts these prompts into SQL over the Delta tables and uses the topic information to return tailored recommendations. From a user’s perspective, they just describe what they fancy cooking; under the hood, traditional ML is doing the heavy lifting.

Overall, this project shows how traditional ML (LDA, feature engineering) and modern AI interfaces (Genie) can work together to deliver an end-to-end recipe recommendation engine on Databricks Free Edition and how an iterative approach to data preparation is often key to getting good results.

Moving forward i would like to revisit classifying Arabic texts and utilise Databricks to help analyse the classical texts. NLP concepts work completed different different in Arabic in comparison to English so i want to see if i can build something truly end to end

You can watch the demo here: https://www.youtube.com/watch?v=JX0qyBD7qyM

If you’re doing anything similar with NLP, topic models or Genie, I’d love to compare approaches 👇

Raman_Unifeye · 3 weeks ago

This is amazing @hasnat_unifeye. Well done and good luck for the hackathon.

RG #Driving Business Outcomes with Data Intelligence

Databricks Community

Hackathon Project: Recipe Recommendation Engine with Traditional ML + Genie on Databricks Free Edit

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 28 – December 04, 2025

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐