Databricks Community

ashish51 · ‎05-31-2026

Hey everyone!
For the DAIS 2026 Community Virtual Challenge, I built a LEGO Value Engine using Databricks Free Edition.
This is a passion project that combined my interests of both LEGOs and Data Engineering.
When a new LEGO set releases, it can be hard to determine if the set is actually worth buying.
Some sets cost hundreds of dollars and are heavily marketed, while others receive much less attention. As a collector, it can be difficult to tell whether you're paying for genuine value or simply paying for branding and popularity.
I wanted to use data to answer that question.

The Problem
When people evaluate LEGO sets, they usually look at one factor at a time:

Price
Piece Count
Average Rating
Number of Ratings given

The challenge is that none of these metrics tell the full story on their own.
A set might have a great rating but be extremely expensive. Another might have thousands of pieces but poor reviews. I wanted a way to evaluate all of these factors together and identify which sets provide the most value for the money.

The Solution

I created an end-to-end analytics pipeline using the Medallion Architecture.

Bronze Layer: Stores the raw LEGO catalog dataset exactly as received and store it as a Delta table.
Silver Layer: Cleans the data and calculates three custom scoring metrics using Spark SQL:
- Smart Value Index: (Piece Count * Average Rating) / Price. This is the primary metric, estimating exactly how much value a customer receives per dollar spent.
- Collector Potential Score: (Average Rating * Number of Reviews) / Price. This highlights sets with community consensus, rather than just a few high ratings.
- Price Efficiency: (Piece Count / Price). Amount of LEGO pieces per dollar.
Gold Layer: Generates business insights, including:
- Theme rankings based on value (Star Wars, Marvel, DC, etc.)
- The top 10 most undervalued sets
- Sets valued over $200 that justify their price

Databricks Features Used

This project uses the following:

Medallion Architecture
Delta Time Travel
Databricks Dashboards
PySpark Machine Learning
Databricks Genie AI

Delta Time Travel
Because LEGO prices change frequently due to reasons like promotions and holiday sales, I used Delta Time Travel. By querying historical versions of the Delta tables, a user can compare previous and current prices to observe exactly how a price change impacts a set's value score over time.

Machine Learning
To add another layer of analysis, I used PySpark to build a KMeans clustering model. Instead of relying on manual categories, the unsupervised model automatically grouped the catalog based on price, ratings, piece count, and overall value. The math identified four distinct market segments: Premium Collectors, High Value Sets, Casual Buyers, and Budget-Friendly Sets.

Dashboards & Genie AI
To make the results easy to view, I built a visual Dashboard. It allows users to explore which themes provide the most value, analyze the relationship between price and ratings, and view the machine learning segments across the catalog.

Finally, I connected the Gold tables to Genie AI. This allows users to ask for recommendations. A user can simply type, "Show me the top 5 undervalued sets under $100," and interact with the data to immediately receive set recommendations.

Why It Matters
While this project focuses on LEGO sets, the idea applies to many industries. Consumers and businesses frequently need to determine which products deliver the best value, which are overpriced, and how inventory naturally groups into different market segments.

This project demonstrates how Databricks can take raw data and transform it into actionable insights using data engineering, machine learning, and AI.

Thanks for reading, and I’d love to hear any feedback on what I can add or improve!

ashish51 · ‎06-04-2026

GitHub Repository: Lego Value Engine

Architecture Diagram (better quality version can be found here)

Databricks Community

DAIS Community Virtual Challenge 2026: LEGO Value Engine - using Data and AI to Find the Best LEGO

Congratulations Databricks Partners! You're Now Officially Recognized in the Databricks Community

Solution Accelerator Series | Measure Ad Effectiveness With Multi-Touch Attribution

Govern AI Spend at Scale: A Data-Driven Approach to AI Governance | Webinar

Databricks AMER Learning Festival | Virtual Training

Introducing the Genie Hub: Ask Questions, Share Builds, and Master Conversational Analytics