Databricks Community

NickKarpov · ‎10-29-2024

A product demo should have a simple story and be easy to deliver. But when your platform interacts with a dozen different systems, what starts as a simple demo can end up feeling like a complex puzzle. We dedicate significant time using tools like Critical User Journeys (CUJs) to simplify our stories, but we just can't help that some Databricks journeys push users out of our product and into intricate setups beyond our control. This complexity motivated my recent project: developing a one-click “Model Customer” deployment that not only provisions a Databricks workspace but also sets up all the necessary cloud infrastructure (AWS, for now) and data-generating mock applications to create truly seamless demo environments.

I made a few discoveries and ran into some unexpected challenges along the way. Terraform, for instance, was a pleasant surprise—it worked pretty much out of the box and could be reliably iterated on with some LLM assistance, making it easy to automate infrastructure and handle the complexities of integrating Databricks with AWS resources. On the data side, using LLM models (GPT-4o in particular) for data generation was a major improvement over traditional libraries. I could describe the data and behaviors I needed in plain English, and the LLM spat out the code that would generate datasets that captured trends, seasonality, and even realistic anomalies—without the tedious, field-by-field setup.

The real challenge was how to model both the backfill of historical data and the real-time flow of events to mimic a real-world application. After some brainstorming (and once again a bit of chatting with GPT), I landed on a solution that not only made the modeling and infrastructure more simple but also, as a bonus, showed just how flexible the Databricks platform can be.

From Dust to Data

In most demos and training environments, it's typical to rely on pre-generated, static datasets. While these can show off specific features, they don’t really capture the dynamic nature of real-world applications. To break out of this box requires something that feels more like an actual production environment. In other words, not just Databricks, but the actual applications and data sources that feed it. The challenge is making all of this easy to deploy—a one liner in the terminal or a few clicks—so that the demo can run through everything from infrastructure setup on the AWS side, to data processing and analytics setup in Databricks.

I now declare you, infrastructure and automated.

While writing scripts to orchestrate this style of deployment with the AWS and Databricks SDKs seemed feasible, it quickly became clear that it wouldn’t meet the broader goal of cross-platform reusability. Managing and reasoning about imperative script-based setups would also be too messy. Instead, a declarative infrastructure approach with Terraform was the right choice. It provides a clean, and easily extensible way to automate the provisioning of environments, ensuring that tedious integrations between Databricks and cloud infrastructure services across AWS, Azure, GCP, etc. are consistently configured.

I was pleasantly surprised by how mature the Databricks Terraform provider is. I could define nearly all of Databricks’ resources, including some of the very latest features still in preview status. The provider covers a wide range of resources available on Databricks, including compute, code (notebooks), storage, security, and even machine learning.

I defined resources for:

AWS Infrastructure: RDS databases .
Networking: Managing VPCs, subnets, and security groups to ensure secure communication between services.
Databricks Workspace: Configuring notebooks, jobs, secrets, and permissions.

Using Terraform modules, combined with version control, meant that every infrastructure change could be tracked like any other code. While I’ve only focused on AWS so far, Terraform’s modular approach has laid a strong foundation for expanding to other clouds like Azure and GCP in the future.

# networking.tf, security_groups.tf providers.tf, rds.tf
resource "aws_vpc" "isolated_vpc" { ... }
resource "aws_security_group" "rds_sg" { ... }
resource "aws_db_instance" "postgresql" { ... }

# databricks_resources.tf
resource "databricks_secret_scope" "rds_scope" { ... }
resource "databricks_catalog" "rds_catalog" { ... }
resource "databricks_notebook" "ddl" { ... }
resource "databricks_job" "realtime_data_gen" { ... }

I should still note that while Terraform is incredibly powerful, it won’t solve all the infrastructure problems. In the future we’ll still need a solution for provisioning various external vendors, which may not have or not even make sense to have Terraform providers for. And then there are always gaps and nuances to discover and work around, like for example the reality that while I can easily define a Databricks Workflow to run a notebook, I can’t actually trigger its execution without an API call.

Data DiLLMa

With the infrastructure automated, the next hurdle was data generation. Realistic data is crucial for meaningful demos, but generating it is notoriously tedious. I needed data that not only looked real but also behaved realistically over time, incorporating trends, seasonal patterns, and anomalies. Traditional data generation libraries require you to define fields, data types, distributions, etc. which is just too much work - this is where I found LLMs really shine.

LLMs excel at understanding and generating human-like text based on context. They often even do so when the human (me) prompts are not fully thought out or described, which is particularly useful when trying to doodle the outlines of a real world data set.

For example, the demo I was working on was a driver sign-in application for a ghost kitchen platform—a service that provides commercial kitchen spaces for both small restaurateurs and large chains to expand their pickup and delivery operations. The drive sign-in app needed to simulate drivers signing in and out, with variables like arrival times, order preparation durations, and even occasional data entry errors.

Using an LLM, I could generate a data generation script that accounted for:

Variable Sign-In and Sign-Out Times: Reflecting different preparation times for various cuisines.
Trends and Seasonality: Incorporating daily, weekly, and seasonal patterns.
Data Anomalies: Simulating real-world data inconsistencies, such as missing sign-out times.
Repeat Customers: Introducing repeat drivers with controlled frequency.

This approach significantly reduced the time and effort required to produce a rich, realistic dataset for the demos.

# gpt-4o generated from "make me a script that generates data for a table like ...

CUSTOMER_NAMES = [ ... ]
BRANDS = [ ... ]
FACILITATORS = [ ... ]

def generate_daily_orders(date, num_orders, start_id):
	orders = []
	for _ in range(num_orders):
		facilitator = random_facilitator()
		order_id = generate_order_identifier(facilitator)
		customer = random_name()
		brand = random_brand()
		signin_time, signout_time = random_timestamp(date)
		orders.append({ ... })

def generate_dataset(start_date, end_date, start_orders, end_orders, variation_percent):
	start_dt = datetime.strptime(start_date, "%Y-%m-%d")
	end_dt = datetime.strptime(end_date, "%Y-%m-%d")
	delta_days = (end_dt - start_dt).days + 1
	dates = [start_dt + timedelta(days=i) for i in range(delta_days)]
	
	# daily order counts with linear growth
	order_counts = np.linspace(start_orders, end_orders, num=delta_days)
 	variation_factor = variation_percent / 100

	order_counts = [
		max(1, ... ) for count in order_counts
	]
	
	all_orders = []

	for date, count in zip(dates, order_counts):
		daily_orders, current_id = generate_daily_orders(date, count, current_id)
		all_orders.extend(daily_orders)
	
	return all_orders

You may notice that the generated code stub above is suspiciously single node, and therefore only suitable for generating smaller datasets, but it was enough for this project. It’s always possible to rewrite this for scale, if and when needed.

Real-Time Data, Real-Time Problem

With historical data in place, the next challenge was simulating real-time data flows. A live demo environment needs to reflect ongoing activities, not just past events, in order to feel as real as possible. There were two challenges to overcome here.

How to Model the Simulation

One option I explored was using an agent-based simulation, where each driver would be an autonomous "agent" generating events like sign-ins and sign-outs in real-time. This method offers high realism by mimicking the real-world interactions exactly. However, it also introduces significant complexity. Each agent needs orchestration and resource management, and broader trends—like business growth, weekly cycles, or seasonal patterns—become especially difficult to coordinate. Since the whole goal of this demo was to improve on the static, stale data that many demo environments rely on, capturing those trends is critical. Without them, the simulation wouldn’t feel dynamic or realistic.

How to Implement the Model

On the implementation side, regardless of the model, the demo needed to be cloud-agnostic. Initially, I considered using AWS-native services like Lambda and EventBridge to trigger agentlike events. While this approach works well for AWS, it would lock the demo into that specific cloud provider, undermining the goal of creating a flexible, portable demo that could run across other platforms like Azure or GCP.

The Unified Timeline Approach

The breakthrough came when I shifted focus from treating historical and real-time data as separate problems to a unified timeline that made no distinction. This approach allowed me to model a continuous flow of events, ensuring that the demo environment remained dynamic and lifelike.

Data Generation Window: I generated data for a period starting one year in the past and extending one month into the future. This gave the environment a fully populated historical base while allowing for real-time events to seamlessly take over.
Bulk Insert of Past Events: Historical events were bulk-inserted into the RDS database, giving the demo a realistic starting point.
Scheduled Processing of Future Events: Future events were stored in a Databricks table, with a scheduled job (via Databricks Workflows) that checked at regular intervals for events due to be triggered. As events reached their scheduled time, they were written to the RDS instance, simulating real-time activity.

It might seem a bit strange to use Databricks to manage and orchestrate this simulation, especially since we’re essentially doing a round trip of the data—writing from Databricks to an external RDS only to read it back. But if you look past that (after all, this is still a demo, not a real production system), the solution is actually quite elegant. It not only keeps the orchestration of the demo cloud-agnostic and, but it also highlights how flexible the Databricks platform is. This round trip showcases the power of Databricks to handle various workflows while still being able to interact seamlessly with other data systems.

Overall, this unified timeline approach provided:

Simplified Infrastructure: By using Databricks for scheduling and event processing, I avoided reliance on additional cloud services, keeping the solution portable and straightforward across different platforms.
Dynamic Trends: combining past and future events on a single timeline allowed me to introduce long-term trends, daily cycles, and seasonal fluctuations—essential for making the environment as real as possible. The result is a demo environment that responds to realistic, evolving conditions, improving on the typical static data used in most training setups.
Ease of Extension: Adjusting the data generation window is easy, which means the demo can stay dynamic and relevant for as long as needed.

By implementing this unified timeline approach, I was able to avoid the complexities of managing individual agents while maintaining a highly realistic environment that evolves with time—capturing the broader trends and dynamic behavior that were key to making the demo feel truly real.

Reflections and Future Directions

I started this project because I was tired of demos that felt disconnected from reality. The technical challenges, while interesting, were just the means to an end. The real goal was to experience our platform the way it actually exists in the wild, not just as an isolated product demo. The hardest problems weren't technical - they were about capturing the messiness and authenticity of real-world usage.

Consolidating infrastructure and authentic data generation in a single package offers a rich and engaging experience. It allows for demos of various Databricks features and capabilities while retaining their connection to the systems that actually produce this data, including:

Real-Time Analytics: Showcasing how Databricks can handle streaming data from external systems
Lakehouse Architecture: Illustrating the unification of data warehousing and data lakes in a single platform.
Advanced Analytics and Machine Learning: Applying predictive models to the simulated data

Looking ahead, there's room to expand and support for:

Cloud compatibility: Azure, GCP
External vendor support: Vendors that produce data
Governance features: SCIM sync etc.
Data complexity: More data generating applications
Open-Source the framework

Creating a one-click deployment solution for comprehensive demo environments has been a rewarding endeavor. It not only simplifies the process of setting up complex integrations but also elevates the quality and impact of demonstrations. By tackling the challenges of data generation and real-time simulation, we've opened the door to more engaging and realistic demos that can better showcase the full potential of Databricks and cloud infrastructure working in harmony.

Databricks Community

Building realistic environments with Terraform, LLMs, and data generation

From Dust to Data

I now declare you, infrastructure and automated.

Data DiLLMa

Real-Time Data, Real-Time Problem

How to Model the Simulation

How to Implement the Model

The Unified Timeline Approach

Overall, this unified timeline approach provided:

Reflections and Future Directions

Best practices for safe data experimentation with Databricks

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL