A product demo should have a simple story and be easy to deliver. But when your platform interacts with a dozen different systems, what starts as a simple demo can end up feeling like a complex puzzle. We dedicate significant time using tools like Critical User Journeys (CUJs) to simplify our stories, but we just can't help that some Databricks journeys push users out of our product and into intricate setups beyond our control. This complexity motivated my recent project: developing a one-click “Model Customer” deployment that not only provisions a Databricks workspace but also sets up all the necessary cloud infrastructure (AWS, for now) and data-generating mock applications to create truly seamless demo environments.
I made a few discoveries and ran into some unexpected challenges along the way. Terraform, for instance, was a pleasant surprise—it worked pretty much out of the box and could be reliably iterated on with some LLM assistance, making it easy to automate infrastructure and handle the complexities of integrating Databricks with AWS resources. On the data side, using LLM models (GPT-4o in particular) for data generation was a major improvement over traditional libraries. I could describe the data and behaviors I needed in plain English, and the LLM spat out the code that would generate datasets that captured trends, seasonality, and even realistic anomalies—without the tedious, field-by-field setup.
The real challenge was how to model both the backfill of historical data and the real-time flow of events to mimic a real-world application. After some brainstorming (and once again a bit of chatting with GPT), I landed on a solution that not only made the modeling and infrastructure more simple but also, as a bonus, showed just how flexible the Databricks platform can be.
In most demos and training environments, it's typical to rely on pre-generated, static datasets. While these can show off specific features, they don’t really capture the dynamic nature of real-world applications. To break out of this box requires something that feels more like an actual production environment. In other words, not just Databricks, but the actual applications and data sources that feed it. The challenge is making all of this easy to deploy—a one liner in the terminal or a few clicks—so that the demo can run through everything from infrastructure setup on the AWS side, to data processing and analytics setup in Databricks.
While writing scripts to orchestrate this style of deployment with the AWS and Databricks SDKs seemed feasible, it quickly became clear that it wouldn’t meet the broader goal of cross-platform reusability. Managing and reasoning about imperative script-based setups would also be too messy. Instead, a declarative infrastructure approach with Terraform was the right choice. It provides a clean, and easily extensible way to automate the provisioning of environments, ensuring that tedious integrations between Databricks and cloud infrastructure services across AWS, Azure, GCP, etc. are consistently configured.
I was pleasantly surprised by how mature the Databricks Terraform provider is. I could define nearly all of Databricks’ resources, including some of the very latest features still in preview status. The provider covers a wide range of resources available on Databricks, including compute, code (notebooks), storage, security, and even machine learning.
I defined resources for:
Using Terraform modules, combined with version control, meant that every infrastructure change could be tracked like any other code. While I’ve only focused on AWS so far, Terraform’s modular approach has laid a strong foundation for expanding to other clouds like Azure and GCP in the future.
# networking.tf, security_groups.tf providers.tf, rds.tf
resource "aws_vpc" "isolated_vpc" { ... }
resource "aws_security_group" "rds_sg" { ... }
resource "aws_db_instance" "postgresql" { ... }
# databricks_resources.tf
resource "databricks_secret_scope" "rds_scope" { ... }
resource "databricks_catalog" "rds_catalog" { ... }
resource "databricks_notebook" "ddl" { ... }
resource "databricks_job" "realtime_data_gen" { ... }
I should still note that while Terraform is incredibly powerful, it won’t solve all the infrastructure problems. In the future we’ll still need a solution for provisioning various external vendors, which may not have or not even make sense to have Terraform providers for. And then there are always gaps and nuances to discover and work around, like for example the reality that while I can easily define a Databricks Workflow to run a notebook, I can’t actually trigger its execution without an API call.
With the infrastructure automated, the next hurdle was data generation. Realistic data is crucial for meaningful demos, but generating it is notoriously tedious. I needed data that not only looked real but also behaved realistically over time, incorporating trends, seasonal patterns, and anomalies. Traditional data generation libraries require you to define fields, data types, distributions, etc. which is just too much work - this is where I found LLMs really shine.
LLMs excel at understanding and generating human-like text based on context. They often even do so when the human (me) prompts are not fully thought out or described, which is particularly useful when trying to doodle the outlines of a real world data set.
For example, the demo I was working on was a driver sign-in application for a ghost kitchen platform—a service that provides commercial kitchen spaces for both small restaurateurs and large chains to expand their pickup and delivery operations. The drive sign-in app needed to simulate drivers signing in and out, with variables like arrival times, order preparation durations, and even occasional data entry errors.
Using an LLM, I could generate a data generation script that accounted for:
This approach significantly reduced the time and effort required to produce a rich, realistic dataset for the demos.
# gpt-4o generated from "make me a script that generates data for a table like ...
CUSTOMER_NAMES = [ ... ]
BRANDS = [ ... ]
FACILITATORS = [ ... ]
def generate_daily_orders(date, num_orders, start_id):
orders = []
for _ in range(num_orders):
facilitator = random_facilitator()
order_id = generate_order_identifier(facilitator)
customer = random_name()
brand = random_brand()
signin_time, signout_time = random_timestamp(date)
orders.append({ ... })
def generate_dataset(start_date, end_date, start_orders, end_orders, variation_percent):
start_dt = datetime.strptime(start_date, "%Y-%m-%d")
end_dt = datetime.strptime(end_date, "%Y-%m-%d")
delta_days = (end_dt - start_dt).days + 1
dates = [start_dt + timedelta(days=i) for i in range(delta_days)]
# daily order counts with linear growth
order_counts = np.linspace(start_orders, end_orders, num=delta_days)
variation_factor = variation_percent / 100
order_counts = [
max(1, ... ) for count in order_counts
]
all_orders = []
for date, count in zip(dates, order_counts):
daily_orders, current_id = generate_daily_orders(date, count, current_id)
all_orders.extend(daily_orders)
return all_orders
You may notice that the generated code stub above is suspiciously single node, and therefore only suitable for generating smaller datasets, but it was enough for this project. It’s always possible to rewrite this for scale, if and when needed.
With historical data in place, the next challenge was simulating real-time data flows. A live demo environment needs to reflect ongoing activities, not just past events, in order to feel as real as possible. There were two challenges to overcome here.
One option I explored was using an agent-based simulation, where each driver would be an autonomous "agent" generating events like sign-ins and sign-outs in real-time. This method offers high realism by mimicking the real-world interactions exactly. However, it also introduces significant complexity. Each agent needs orchestration and resource management, and broader trends—like business growth, weekly cycles, or seasonal patterns—become especially difficult to coordinate. Since the whole goal of this demo was to improve on the static, stale data that many demo environments rely on, capturing those trends is critical. Without them, the simulation wouldn’t feel dynamic or realistic.
On the implementation side, regardless of the model, the demo needed to be cloud-agnostic. Initially, I considered using AWS-native services like Lambda and EventBridge to trigger agentlike events. While this approach works well for AWS, it would lock the demo into that specific cloud provider, undermining the goal of creating a flexible, portable demo that could run across other platforms like Azure or GCP.
The breakthrough came when I shifted focus from treating historical and real-time data as separate problems to a unified timeline that made no distinction. This approach allowed me to model a continuous flow of events, ensuring that the demo environment remained dynamic and lifelike.
It might seem a bit strange to use Databricks to manage and orchestrate this simulation, especially since we’re essentially doing a round trip of the data—writing from Databricks to an external RDS only to read it back. But if you look past that (after all, this is still a demo, not a real production system), the solution is actually quite elegant. It not only keeps the orchestration of the demo cloud-agnostic and, but it also highlights how flexible the Databricks platform is. This round trip showcases the power of Databricks to handle various workflows while still being able to interact seamlessly with other data systems.
By implementing this unified timeline approach, I was able to avoid the complexities of managing individual agents while maintaining a highly realistic environment that evolves with time—capturing the broader trends and dynamic behavior that were key to making the demo feel truly real.
I started this project because I was tired of demos that felt disconnected from reality. The technical challenges, while interesting, were just the means to an end. The real goal was to experience our platform the way it actually exists in the wild, not just as an isolated product demo. The hardest problems weren't technical - they were about capturing the messiness and authenticity of real-world usage.
Consolidating infrastructure and authentic data generation in a single package offers a rich and engaging experience. It allows for demos of various Databricks features and capabilities while retaining their connection to the systems that actually produce this data, including:
Looking ahead, there's room to expand and support for:
Creating a one-click deployment solution for comprehensive demo environments has been a rewarding endeavor. It not only simplifies the process of setting up complex integrations but also elevates the quality and impact of demonstrations. By tackling the challenges of data generation and real-time simulation, we've opened the door to more engaging and realistic demos that can better showcase the full potential of Databricks and cloud infrastructure working in harmony.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.