Hi everyone,
For the Databricks Free Edition Hackathon, I wanted to show that traditional ML still has a big role today, and how it can work hand-in-hand with Databricksโ newer AI tooling. As a concrete use case, I created a recipe recommendation engine that helps suggest relevant recipes to users, using classic NLP and topic modelling to structure the data, then AI/BI Genie to bring that value out for end users. Both work together rather than instead of each other. Ive always had an Interest in NLP tools in analysis classical Arabic texts, but ive never built and end to end solution in Databricks that could bring an NLP solution to life, so i thought this was a great chance to do demonstrate it.
What I built
I built a recipe recommendation engine the following components:
- Lakeflow Declarative Pipelines (LDP) to ingest and prepare text data using a madallion architecture.
- PySpark ML with an LDA topic model to discover themes in recipes, built as a job in databricks.
- AI/BI Genie on top to explore and get recommendations in natural language
All of this runs on Databricks Free Edition.
Data & preparation (LDP + tokenising)
The starting point was a Kaggle recipes dataset with titles, descriptions and ingredients.
Using LDP, I set up a simple pipeline to:
- Bronze - Ingest the raw data
- Silver - Clean obvious issues (duplicates, missing key fields)
- Silver - Focus on the crucial NLP step: tokenising the text
- Gold โ Aggregates which includes downstream tables with the title and the words (tokens) associated to the recipes
By tokenising, I mean:
- Breaking text into meaningful words
- Example: "Spicy tomato pasta with fresh basil"
["spicy", "tomato", "pasta", "fresh", "basil"]
- Removing noise (Stop words)
- Stripping out filler words like โandโ, โwithโ, โtheโ - We call these stop words. Stop words such as โandโ don't really bring value to classification. Systems
- Normalising similar word forms
- Treating "cook", "cooks", "cooked", "cooking" more like the same underlying word, so the model focuses on the concept rather than inflected forms.
After building the initial pipeline, I continuously monitored and iterated on the data rather than assuming it was โdoneโ:
- I used Genie to ask questions like โWhich very common words bring little or no value to the corpus?โ to surface extra stopwords that werenโt helpful for modelling.
- I also generated a word cloud image over the tokens and discovered that some values were actually encoded fractions like u00bd (ยฝ) leaking into the text. That fed back into the cleaning logic in my LDP so these artefacts were stripped out.
This showed that working with text data is rarely straightforward: you often need an iterative loop of inspecting the data, tightening the cleaning and tokenisation, and re-running the pipeline until the corpus looks meaningful for downstream ML.
Topic modelling with PySpark ML (LDA)
Once the recipes were tokenised, I used PySpark ML to apply a classic NLP technique called Latent Dirichlet Allocation (LDA).
At a high level, LDA assumes that:
- Each recipe is a mixture of a few underlying topics (for example โpastaโ, โcurriesโ, โbakingโ), and
- Each topic is defined by a particular set of words that tend to appear together (e.g. a โpastaโ topic might be dominated by words like pasta, tomato, garlic, olive oil).
The approach I took was:
Turn tokens into numeric features
- From the token lists, I built a simple count-based representation: for each recipe, how often each word appears. This gives the model a structured view of the language used in the dataset, rather than raw text.
Fit an LDA model over the whole corpus
- LDA looks across all recipes and learns a fixed number of topics. It doesnโt know anything about โItalianโ or โdessertโ explicitly โ it discovers topics purely from patterns in word co-occurrence.
Assign each recipe a topic profile
- The model then gives every recipe a topic distribution โ for example, a recipe might be 70% โpasta/Italianโ, 20% โquick midweek mealsโ, 10% โvegetarianโ. This topic profile becomes a compact, semantic fingerprint for that recipe.
These topic profiles are what I use for the recommendation engine: recipes with similar topic distributions are treated as similar, so I can recommend recipes that share underlying themes and flavour profiles, not just recipes with exactly the same ingredients.
Bringing it together with Genie
To make this useful beyond notebooks, I added AI/BI Genie on top of the curated tables:
Genie understands the recipe attributes and topic features.
You can ask questions like:
- โRecommend three vegetarian recipes similar to Spicy Chickpea Curry.โ
- โShow quick pasta dishes with a similar flavour profile to Garlic Shrimp Pasta.โ
Genie converts these prompts into SQL over the Delta tables and uses the topic information to return tailored recommendations. From a userโs perspective, they just describe what they fancy cooking; under the hood, traditional ML is doing the heavy lifting.
Overall, this project shows how traditional ML (LDA, feature engineering) and modern AI interfaces (Genie) can work together to deliver an end-to-end recipe recommendation engine on Databricks Free Edition and how an iterative approach to data preparation is often key to getting good results.
Moving forward i would like to revisit classifying Arabic texts and utilise Databricks to help analyse the classical texts. NLP concepts work completed different different in Arabic in comparison to English so i want to see if i can build something truly end to end
You can watch the demo here: https://www.youtube.com/watch?v=JX0qyBD7qyM
If youโre doing anything similar with NLP, topic models or Genie, Iโd love to compare approaches ๐