<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hackathon Project: Recipe Recommendation Engine with Traditional ML + Genie on Databricks Free Edit in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/hackathon-project-recipe-recommendation-engine-with-traditional/m-p/139048#M896</link>
    <description>&lt;P&gt;&lt;SPAN&gt;Hi everyone,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;For the Databricks Free Edition Hackathon, I wanted to show that &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;traditional ML still has a big role today&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;, and how it can work hand-in-hand with &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Databricks’ newer AI tooling&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;. As a concrete use case, I created a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;recipe recommendation engine&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; that helps suggest relevant recipes to users, using classic NLP and topic modelling to structure the data, then &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;AI/BI Genie&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; to bring that value out for end users. Both work together rather than instead of each other. Ive always had an Interest in NLP tools in analysis classical Arabic texts, but ive never built and end to end solution in Databricks that could bring an NLP solution to life, so i thought this was a great chance to do demonstrate it. &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;SPAN&gt;What I built&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I built a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;recipe recommendation engine&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; the following components:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;Lakeflow Declarative Pipelines (LDP)&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; to ingest and prepare text data using a madallion architecture.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;PySpark ML&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; with an &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;LDA topic model&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; to discover themes in recipes, built as a job in databricks.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;AI/BI Genie&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; on top to explore and get recommendations in natural language&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;All of this runs on &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Databricks Free Edition&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Data &amp;amp; preparation (LDP + tokenising)&lt;/STRONG&gt;&amp;nbsp;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The starting point was a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Kaggle recipes dataset&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; with titles, descriptions and ingredients.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Using &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;LDP&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;, I set up a simple pipeline to:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Bronze - Ingest the raw data &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Silver - Clean obvious issues (duplicates, missing key fields)&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Silver - Focus on the crucial NLP step: &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;tokenising&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; the text&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Gold – Aggregates which includes downstream tables with the title and the words (tokens) associated to the recipes &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;By tokenising, I mean:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;Breaking text into meaningful words &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Example: "Spicy tomato pasta with fresh basil"&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;["spicy", "tomato", "pasta", "fresh", "basil"]&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;Removing noise (Stop words)&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Stripping out filler words like “and”, “with”, “the” - We call these stop words. Stop words such as “and” don't really bring value to classification. Systems &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;Normalising similar word forms &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Treating "cook", "cooks", "cooked", "cooking" more like the same underlying word, so the model focuses on the concept rather than inflected forms. &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;After building the initial pipeline, I &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;continuously monitored and iterated on the data&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; rather than assuming it was “done”:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;I used &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Genie&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; to ask questions like &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;“Which very common words bring little or no value to the corpus?”&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt; to surface extra stopwords that weren’t helpful for modelling.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;I also generated a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;word cloud&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; image over the tokens and discovered that some values were actually encoded fractions like u00bd (½) leaking into the text. That fed back into the cleaning logic in my LDP so these artefacts were stripped out.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;This showed that working with text data is rarely straightforward: you often need an &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;iterative loop&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; of inspecting the data, tightening the cleaning and tokenisation, and re-running the pipeline until the corpus looks meaningful for downstream ML.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Topic modelling with PySpark ML (LDA)&lt;/STRONG&gt;&amp;nbsp;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Once the recipes were tokenised, I used &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;PySpark ML&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; to apply a classic NLP technique called &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Latent Dirichlet Allocation (LDA)&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;At a high level, LDA assumes that:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Each &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;recipe&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; is a mixture of a few underlying &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;topics&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; (for example “pasta”, “curries”, “baking”), and&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Each &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;topic&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; is defined by a particular set of words that tend to appear together (e.g. a “pasta” topic might be dominated by words like &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;pasta, tomato, garlic, olive oil&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;The approach I took was:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;SPAN&gt;Turn tokens into numeric features&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;From the token lists, I built a simple count-based representation: for each recipe, how often each word appears. This gives the model a structured view of the language used in the dataset, rather than raw text.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;&lt;SPAN&gt;Fit an LDA model over the whole corpus&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;LDA looks across all recipes and learns a fixed number of topics. It doesn’t know anything about “Italian” or “dessert” explicitly – it discovers topics purely from patterns in word co-occurrence.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;&lt;SPAN&gt;Assign each recipe a topic profile&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;The model then gives every recipe a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;topic distribution&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; – for example, a recipe might be 70% “pasta/Italian”, 20% “quick midweek meals”, 10% “vegetarian”. This topic profile becomes a compact, semantic fingerprint for that recipe.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;These topic profiles are what I use for the &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;recommendation engine&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;: recipes with similar topic distributions are treated as similar, so I can recommend recipes that share underlying themes and flavour profiles, not just recipes with exactly the same ingredients.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Bringing it together with Genie&amp;nbsp;&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;To make this useful beyond notebooks, I added &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;AI/BI Genie&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; on top of the curated tables:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Genie understands the recipe attributes and topic features.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;You can ask questions like:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;“Recommend three vegetarian recipes similar to &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;Spicy Chickpea Curry&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;.”&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;“Show quick pasta dishes with a similar flavour profile to &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;Garlic Shrimp Pasta&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;.”&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;Genie converts these prompts into SQL over the Delta tables and uses the topic information to return tailored recommendations. From a user’s perspective, they just describe what they fancy cooking; under the hood, traditional ML is doing the heavy lifting. &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Overall, this project shows how &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;traditional ML (LDA, feature engineering)&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; and &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;modern AI interfaces (Genie)&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; can work together to deliver an end-to-end &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;recipe recommendation engine&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; on &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Databricks Free Edition&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; and how an iterative approach to data preparation is often key to getting good results. &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Moving forward i would like to revisit classifying Arabic texts and utilise Databricks to help analyse the classical texts. NLP concepts work completed different different in Arabic in comparison to English so i want to see if i can build something truly end to end &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;You can watch the demo here:&amp;nbsp;&lt;A href="https://www.youtube.com/watch?v=JX0qyBD7qyM" target="_blank"&gt;https://www.youtube.com/watch?v=JX0qyBD7qyM&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;If you’re doing anything similar with NLP, topic models or Genie, I’d love to compare approaches &lt;span class="lia-unicode-emoji" title=":backhand_index_pointing_down:"&gt;👇&lt;/span&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 14 Nov 2025 10:27:46 GMT</pubDate>
    <dc:creator>hasnat_unifeye</dc:creator>
    <dc:date>2025-11-14T10:27:46Z</dc:date>
    <item>
      <title>Hackathon Project: Recipe Recommendation Engine with Traditional ML + Genie on Databricks Free Edit</title>
      <link>https://community.databricks.com/t5/community-articles/hackathon-project-recipe-recommendation-engine-with-traditional/m-p/139048#M896</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hi everyone,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;For the Databricks Free Edition Hackathon, I wanted to show that &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;traditional ML still has a big role today&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;, and how it can work hand-in-hand with &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Databricks’ newer AI tooling&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;. As a concrete use case, I created a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;recipe recommendation engine&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; that helps suggest relevant recipes to users, using classic NLP and topic modelling to structure the data, then &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;AI/BI Genie&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; to bring that value out for end users. Both work together rather than instead of each other. Ive always had an Interest in NLP tools in analysis classical Arabic texts, but ive never built and end to end solution in Databricks that could bring an NLP solution to life, so i thought this was a great chance to do demonstrate it. &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;SPAN&gt;What I built&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I built a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;recipe recommendation engine&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; the following components:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;Lakeflow Declarative Pipelines (LDP)&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; to ingest and prepare text data using a madallion architecture.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;PySpark ML&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; with an &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;LDA topic model&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; to discover themes in recipes, built as a job in databricks.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;AI/BI Genie&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; on top to explore and get recommendations in natural language&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;All of this runs on &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Databricks Free Edition&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Data &amp;amp; preparation (LDP + tokenising)&lt;/STRONG&gt;&amp;nbsp;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The starting point was a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Kaggle recipes dataset&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; with titles, descriptions and ingredients.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Using &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;LDP&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;, I set up a simple pipeline to:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Bronze - Ingest the raw data &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Silver - Clean obvious issues (duplicates, missing key fields)&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Silver - Focus on the crucial NLP step: &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;tokenising&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; the text&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Gold – Aggregates which includes downstream tables with the title and the words (tokens) associated to the recipes &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;By tokenising, I mean:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;Breaking text into meaningful words &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Example: "Spicy tomato pasta with fresh basil"&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;["spicy", "tomato", "pasta", "fresh", "basil"]&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;Removing noise (Stop words)&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Stripping out filler words like “and”, “with”, “the” - We call these stop words. Stop words such as “and” don't really bring value to classification. Systems &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN&gt;Normalising similar word forms &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Treating "cook", "cooks", "cooked", "cooking" more like the same underlying word, so the model focuses on the concept rather than inflected forms. &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;After building the initial pipeline, I &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;continuously monitored and iterated on the data&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; rather than assuming it was “done”:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;I used &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Genie&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; to ask questions like &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;“Which very common words bring little or no value to the corpus?”&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt; to surface extra stopwords that weren’t helpful for modelling.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;I also generated a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;word cloud&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; image over the tokens and discovered that some values were actually encoded fractions like u00bd (½) leaking into the text. That fed back into the cleaning logic in my LDP so these artefacts were stripped out.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;This showed that working with text data is rarely straightforward: you often need an &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;iterative loop&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; of inspecting the data, tightening the cleaning and tokenisation, and re-running the pipeline until the corpus looks meaningful for downstream ML.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Topic modelling with PySpark ML (LDA)&lt;/STRONG&gt;&amp;nbsp;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Once the recipes were tokenised, I used &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;PySpark ML&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; to apply a classic NLP technique called &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Latent Dirichlet Allocation (LDA)&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;At a high level, LDA assumes that:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Each &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;recipe&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; is a mixture of a few underlying &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;topics&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; (for example “pasta”, “curries”, “baking”), and&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Each &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;topic&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; is defined by a particular set of words that tend to appear together (e.g. a “pasta” topic might be dominated by words like &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;pasta, tomato, garlic, olive oil&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;The approach I took was:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;SPAN&gt;Turn tokens into numeric features&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;From the token lists, I built a simple count-based representation: for each recipe, how often each word appears. This gives the model a structured view of the language used in the dataset, rather than raw text.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;&lt;SPAN&gt;Fit an LDA model over the whole corpus&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;LDA looks across all recipes and learns a fixed number of topics. It doesn’t know anything about “Italian” or “dessert” explicitly – it discovers topics purely from patterns in word co-occurrence.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;&lt;SPAN&gt;Assign each recipe a topic profile&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;The model then gives every recipe a &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;topic distribution&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; – for example, a recipe might be 70% “pasta/Italian”, 20% “quick midweek meals”, 10% “vegetarian”. This topic profile becomes a compact, semantic fingerprint for that recipe.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;These topic profiles are what I use for the &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;recommendation engine&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;: recipes with similar topic distributions are treated as similar, so I can recommend recipes that share underlying themes and flavour profiles, not just recipes with exactly the same ingredients.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Bringing it together with Genie&amp;nbsp;&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;To make this useful beyond notebooks, I added &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;AI/BI Genie&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; on top of the curated tables:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Genie understands the recipe attributes and topic features.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;You can ask questions like:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;“Recommend three vegetarian recipes similar to &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;Spicy Chickpea Curry&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;.”&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;“Show quick pasta dishes with a similar flavour profile to &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;Garlic Shrimp Pasta&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;.”&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;Genie converts these prompts into SQL over the Delta tables and uses the topic information to return tailored recommendations. From a user’s perspective, they just describe what they fancy cooking; under the hood, traditional ML is doing the heavy lifting. &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Overall, this project shows how &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;traditional ML (LDA, feature engineering)&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; and &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;modern AI interfaces (Genie)&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; can work together to deliver an end-to-end &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;recipe recommendation engine&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; on &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN&gt;Databricks Free Edition&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt; and how an iterative approach to data preparation is often key to getting good results. &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Moving forward i would like to revisit classifying Arabic texts and utilise Databricks to help analyse the classical texts. NLP concepts work completed different different in Arabic in comparison to English so i want to see if i can build something truly end to end &lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;You can watch the demo here:&amp;nbsp;&lt;A href="https://www.youtube.com/watch?v=JX0qyBD7qyM" target="_blank"&gt;https://www.youtube.com/watch?v=JX0qyBD7qyM&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;If you’re doing anything similar with NLP, topic models or Genie, I’d love to compare approaches &lt;span class="lia-unicode-emoji" title=":backhand_index_pointing_down:"&gt;👇&lt;/span&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Nov 2025 10:27:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/hackathon-project-recipe-recommendation-engine-with-traditional/m-p/139048#M896</guid>
      <dc:creator>hasnat_unifeye</dc:creator>
      <dc:date>2025-11-14T10:27:46Z</dc:date>
    </item>
    <item>
      <title>Re: Hackathon Project: Recipe Recommendation Engine with Traditional ML + Genie on Databricks Free E</title>
      <link>https://community.databricks.com/t5/community-articles/hackathon-project-recipe-recommendation-engine-with-traditional/m-p/139107#M897</link>
      <description>&lt;P&gt;This is amazing&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/197392"&gt;@hasnat_unifeye&lt;/a&gt;. Well done and good luck for the hackathon.&lt;/P&gt;</description>
      <pubDate>Fri, 14 Nov 2025 15:07:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/hackathon-project-recipe-recommendation-engine-with-traditional/m-p/139107#M897</guid>
      <dc:creator>Raman_Unifeye</dc:creator>
      <dc:date>2025-11-14T15:07:37Z</dc:date>
    </item>
  </channel>
</rss>

