Hey everyone!
For the DAIS 2026 Community Virtual Challenge, I built a LEGO Value Engine using Databricks Free Edition.
This is a passion project that combined my interests of both LEGOs and Data Engineering.
When a new LEGO set releases, it can be hard to determine if the set is actually worth buying.
Some sets cost hundreds of dollars and are heavily marketed, while others receive much less attention. As a collector, it can be difficult to tell whether you're paying for genuine value or simply paying for branding and popularity.
I wanted to use data to answer that question.
The Problem
When people evaluate LEGO sets, they usually look at one factor at a time:
- Price
- Piece Count
- Average Rating
- Number of Ratings given
The challenge is that none of these metrics tell the full story on their own.
A set might have a great rating but be extremely expensive. Another might have thousands of pieces but poor reviews. I wanted a way to evaluate all of these factors together and identify which sets provide the most value for the money.
The Solution
I created an end-to-end analytics pipeline using the Medallion Architecture.
- Bronze Layer: Stores the raw LEGO catalog dataset exactly as received and store it as a Delta table.
- Silver Layer: Cleans the data and calculates three custom scoring metrics using Spark SQL:
- Smart Value Index: (Piece Count * Average Rating) / Price. This is the primary metric, estimating exactly how much value a customer receives per dollar spent.
- Collector Potential Score: (Average Rating * Number of Reviews) / Price. This highlights sets with community consensus, rather than just a few high ratings.
- Price Efficiency: (Piece Count / Price). Amount of LEGO pieces per dollar.
- Gold Layer: Generates business insights, including:
- Theme rankings based on value (Star Wars, Marvel, DC, etc.)
- The top 10 most undervalued sets
- Sets valued over $200 that justify their price
Databricks Features Used
This project uses the following:
- Medallion Architecture
- Delta Time Travel
- Databricks Dashboards
- PySpark Machine Learning
- Databricks Genie AI
Delta Time Travel
Because LEGO prices change frequently due to reasons like promotions and holiday sales, I used Delta Time Travel. By querying historical versions of the Delta tables, a user can compare previous and current prices to observe exactly how a price change impacts a set's value score over time.
Machine Learning
To add another layer of analysis, I used PySpark to build a KMeans clustering model. Instead of relying on manual categories, the unsupervised model automatically grouped the catalog based on price, ratings, piece count, and overall value. The math identified four distinct market segments: Premium Collectors, High Value Sets, Casual Buyers, and Budget-Friendly Sets.
Dashboards & Genie AI
To make the results easy to view, I built a visual Dashboard. It allows users to explore which themes provide the most value, analyze the relationship between price and ratings, and view the machine learning segments across the catalog.
Finally, I connected the Gold tables to Genie AI. This allows users to ask for recommendations. A user can simply type, "Show me the top 5 undervalued sets under $100," and interact with the data to immediately receive set recommendations.
Why It Matters
While this project focuses on LEGO sets, the idea applies to many industries. Consumers and businesses frequently need to determine which products deliver the best value, which are overpriced, and how inventory naturally groups into different market segments.
This project demonstrates how Databricks can take raw data and transform it into actionable insights using data engineering, machine learning, and AI.
Thanks for reading, and I’d love to hear any feedback on what I can add or improve!