Databricks Community

PanosAthanasiou

Introduction

After Databricks' Data + AI Summit, an electrifying hackathon lit up the London HQ. Teams from across industries came together to build bold, tangible solutions using DBX’s latest features. And… we won! How? Vision, teamwork, and a killer idea.

The hackathon, titled “Back from Summit: Review of the Major Announcements + GenAI Hackathon,” was held at Databricks’ London HQ on Windmill Street. We were especially impressed by the new features that were announced, like Agent Bricks and AI/BI Genie. Our team brought a powerhouse mix of skills to the table: a seasoned software veteran, Enoch, a deep learning and LLM expert, Jorge, and a math wizard, Sergio.

From Manual Guesswork to Data-Driven Decisions

We aimed to resolve a vital distribution problem for a retail distributor. Our hypothetical company, FastBuy, has successfully expanded to 10 shops around Europe selling a variety of products. However, like many companies in this transformative age, FastBuy relied on a manual process for stock distribution. Decisions were based on analysing old Excel sheets and making back-of-the-envelope estimates for each product.

Figure 1: Map showing the location of our 10 shops around Europe.

We identified a key enhancement that could significantly boost FastBuy’s profit: historical data-driven forecasting based on past product and store sales. Our predictive model outperformed their previous baseline, increasing revenue by a whooping 280%. More details on the mathematical models and the assumed baseline in the next section: Mathematical Modelling.

Figure 2: Comparing various forecasting models against the baseline shows a revenue increase from ~$280M to ~$800M — a clear testament to the power of data-driven forecasting.

The second part of the problem involved thinking how to expose the predictive model to be useful in a company that was just starting to modernize their tech. One cannot simply automate everything, as trustful and reliable transformation is iterative, and relies on user feedback. We opted to build a tool that could inform human analysts when making decisions. The tool both provided a playground for scenario experimentation, by trying out different supply strategies and observing predicted yield, and also a chat interface where less technical users could interact with the data through a foundation model. The details on the application are presented in section Our Solution.

The dataset provided for the hackathon contained the historical demand of several items across ten stores, with columns such as Date, Sales, Item ID, and Store ID. Building on this foundation, we enhanced it with two additional dimension tables: Store (including Store ID and Location) and Item (including Item ID, Price, and Cost). This expanded schema (illustrated in Figure 3) allowed us to simulate realistic supply strategies and measure their economic impact.

Figure 3: FastBuy dataset schema used in the hackathon. The original sales data was expanded with Store and Item tables to enable realistic, economically grounded simulations.

In future iterations, the schema could be extended further to incorporate elements such as product source locations or inter-store distances, enabling more complex and spatially aware optimization scenarios.

Mathematical Modelling

Now to the exciting part. What was the mathematical modelling that enabled the 280% increase in profits? First of all, let’s frame the problem as an optimisation problem. We are selling n items x1,, xn, which have two values associated to them: at time t, their costs are c1,, cn and their market prices are p1,, pn.The cost encapsulates the cost our company incurs in manufacture, transportation and selling. Despite the simplification of the scenario, this is indeed the classic framework used in basic microeconomic models (although we have added product heterogeneity). We work with the assumption that our company is price-accepting (again, a classic microeconomic assumption), which just means that prices are fixed and given by market equilibrium. The same goes for the costs. Also, at every given time, there is some demand for each item given by d1(t),, dn(t) . The aim is to choose a supply s1(t),, sn(t) which maximises our expected profit. Mathematically, this is given by

Now, the problem is that we need to choose our product supply before we know what the actual demand is! Otherwise, we wouldn’t have time to ship our products to the shops.

Before the data scientists joined our team we had heuristic, intuition-based strategies. However, bringing time series forecasting into play has revolutionalised the way we choose our supply strategy. The idea is very simple: at time t-1 we make a forecast for tomorrow’s demand and use that as our supply, ie. si(t)=di(t)=E[di(t)|ds(t), s<t]. The last line is just telling us that our forecast at time t is the expected value of the demand given all past data. But, what models did we use to make these forecasts?

1. Autoregressive Models with Lag p

The formula is given by

where the alpha parameters are chosen analytically by an ordinary least squares method and t is the difference between our estimate and the true value. This model is the canonical benchmark model in many time series forecasting papers.

2. AutoArima Models

These are an extension of the above. Arima(p,q) models are given by the equation

In order to choose the hyperparameters p,q, we used the auto-Arima variant which uses the Bayesian Information Criterion. Essentially, this is a log-likelihood hyperparameter selection criterion which penalises overparametrisation to avoid overfitting.

Right now, our data scientists are working on expanding our modelling to include multivariate models. This would allow us to study spatial correlations between demands across different shops. This is useful because, for example, understanding the demand of a certain product in Paris could help us understand the demand for that same product in London.

Our Solution

Our application leverages a Streamlit frontend deployed within Databricks Apps to deliver a responsive and interactive user experience. This setup allows seamless integration with Databricks-hosted data and model endpoints, enabling real-time analytics and visualization. Development collaboration is streamlined using the Databricks CLI’s “sync” command, which keeps local codebases in sync with remote notebooks via VSCode. This workflow facilitates rapid iteration and deployment, especially in team environments where reproducibility and version control are critical.

The core functionality of the app centers around dynamic data retrieval and visualization. It connects to a Databricks SQL warehouse using secure credentials and fetches structured sales data from a designated table. Users can filter the dataset by store, date range, and forecasting strategy, which dynamically updates the visualizations. These include time-series plots for demand, supply, and cumulative profit, as well as bar charts for top-selling items and heatmaps for store-item sales distribution. Strategy-specific columns are selected based on user input, allowing comparative analysis across the forecasting models such as Auto ARIMA, and Auto Regressive, presented in the previous section.

To support contextual insights and conversational AI features, the app aggregates key business metrics into a structured summary. This includes total sales, top and bottom performing items and stores, and the selected date range. These metrics are stored in session state for downstream use in chat-based interactions or decision support modules. The application also includes robust error handling for missing environment variables and fallback logic for mock data generation, ensuring continuity in development and testing environments. Overall, the architecture balances modularity, scalability, and user-centric design, making it suitable for production-grade analytics workflows.

Taking It Further

Easy wins offered by Databricks features:

Models as endpoints: To build a scalable application, we can separate the application’s user interface from the underlying statistical models, and serve the models through endpoints. This can be easily done in Databricks by registering a custom PyFunc model with MLFlow and then deploying it to a managed endpoint. Managed endpoints have the added advantage of dealing with autoscaling, which would make the app run faster, and cost less.
Genie: The LLM component can be enhanced by querying Genie instead of an LLM, and setting up Genie to access FastBuy’s dataset.
MLFlow experiment tracking: Experimentation by FastBuy’s Data Scientists is logged in MLFlow, with the corresponding metadata and hyperparameters for each run. This is a best practice that is often overlooked.

In terms of our Application, we built it with the intention of having a section where a user could test different supply strategies and figure out the best one. It could be connected to a distribution software so that once validated by a user, a particular strategy could be automatically scheduled.

Unlock the power of GenAI

Leverage our technical GenAI expertise to support and accelerate your organization's GenAI strategy. Whether you're just getting started or looking to scale, we can help you identify, prioritize, and implement high-impact GenAI use cases tailored to your business goals.

Our team can support you in your GenAI journey – from ideation to production—to ensure you get the most out of your GenAI investments.

Interested in learning more? Reach out to one of our experts today!

Who are we?

Enoch Kan and Jorge Gallego Feliciano work at Aimpoint Digital, where they help organizations harness data and AI to make robust, evidence-based business decisions.

Aimpoint Digital is a market-leading analytics firm at the forefront of solving the most complex business and economic challenges through data and analytical technology. From integrating self-service analytics to implementing AI at scale and modernizing data infrastructure environments, Aimpoint Digital operates across transformative domains to improve the performance of organizations. Connect with our team and get started today.

Sergio Estan Ruiz is a PhD student at Imperial College London where he is pursuing a degree in Machine Learning for Public Policy. He is interested in time series, LLMs, geometric and topological deep learning and more!

Databricks Community

From Hackathon to Impact: Building a Data-Driven Forecasting App on Databricks

Introduction

From Manual Guesswork to Data-Driven Decisions

Mathematical Modelling

Our Solution

Taking It Further

Unlock the power of GenAI

Who are we?

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks