Databricks Community

BenBricks · ‎10-11-2025

Hi,

I come from a traditional ETL background and am having trouble understanding some of the cloud hyper scalar features and use cases.

I understand Databricks is hosted on a cloud providers. I see the cloud providers have their own tools for ETL, ML/AI etc. So what is the advantage or use case for using Databricks over the cloud providers tools.

Please explain like you would to a 5yr old :). As I said I am totally new to this space.

-Benedict

Gecofer · ‎10-12-2025

Hey Benedict!

That’s actually a great question and one that a lot of people have when they come from a traditional ETL background.

Before diving in, can I ask which cloud you’re using? (AWS, Azure, or GCP?) — because each one has its own native tools (like AWS Glue, Azure Data Factory, or Google Dataflow), and the best way to explain Databricks is by comparing it to the specific tools you already know.

But let’s make it really simple for now:

Imagine your cloud provider is like a big supermarket. It has everything — one shelf for data ingestion, another for data cleaning, another for machine learning, another for dashboards...
Now imagine Databricks as your kitchen inside that supermarket. You can grab any ingredient from any shelf (AWS, Azure, or GCP storage, databases, or APIs) and then cook your data end-to-end in one place — clean it, transform it, analyze it, and even train AI models — without leaving the kitchen.

Why Databricks for ETL

Sure, you can do ETL in cloud-native tools like Glue, Data Factory, or Dataflow… but:

Those tools usually follow a “black box” pattern: you configure jobs, but you don’t really see what’s happening behind the scenes.
They’re great for pipelines, but less flexible for data exploration, debugging, or incremental data transformations.

With Databricks, your ETL lives in notebooks — so you can combine SQL, Python, or Spark seamlessly. You can prototype, test, and productionize in the same environment.

Plus, because Databricks runs on top of Apache Spark, it can handle massive amounts of data efficiently. It’s the same engine used by cloud ETL tools under the hood — but in Databricks, you have full control.

Why Databricks for ML & AI

Cloud providers have their own ML platforms (SageMaker, Azure ML, Vertex AI), but Databricks adds something very powerful — MLflow — which is built-in. That means you can:

Track all your experiments (code, parameters, metrics, models) automatically.
Version your models the same way you version code.
Serve your models with just one click using Databricks Model Serving.
Move from ETL to training to deployment without switching tools or teams.

It’s like having the entire MLOps lifecycle inside one platform — unified, reproducible, and auditable.

Why Databricks for SQL & Data Engineering

Databricks also includes Databricks SQL, which lets you query your data lake directly as if it were a data warehouse. You can:

Run fast, optimized queries using Photon, the high-performance execution engine.
Create dashboards and alerts directly from the Lakehouse.
Mix SQL and Python — great for data engineers and analysts working together.

In a way, Databricks bridges the gap between data lakes and data warehouses — that’s why it’s called a Lakehouse.

In short

The cloud gives you the infrastructure (storage, compute, security).
Databricks gives you the brain — the collaborative layer that unifies ETL, analytics, and AI in one environment.
It works on top of your cloud, so you still use your existing resources and governance.

You can think of it as turning your raw cloud storage into a full data & AI platform.

Hope that helps clear things up.

Gema 👩‍💻

View solution in original post

Gecofer · ‎10-13-2025

Thanks for your follow-up! That’s a really good and fair question — especially for folks coming from traditional warehouses or on-prem environments.

In most projects I’ve worked on, it’s true that the platform or infra team is responsible for cost control, monitoring usage patterns, and setting alerts if a specific team or job suddenly spikes in consumption.

However, as a developer, DE or ML engineer, I personally still try to stay aware of my own resource usage, even if I don’t always know the exact €/$ cost of a given job or training run. Why? Because in cloud environments:

Everything is elastic and on-demand, which is great for flexibility but it also means every compute second and every GB of memory has a price tag attached.

So my own rule of thumb is:
Always start with the smallest cluster possible, and only scale up if the code/job really needs it.

I avoid using the largest clusters "just in case" — it’s better to monitor memory usage, optimize logic, and grow gradually if needed.

Databricks also uses its own cost unit called DBUs (Databricks Units), which makes it easier to track costs in a normalized way across workloads — but keep in mind that the actual machines are running on your cloud provider (Azure, AWS, GCP). Also, serverless compute in Databricks can be great for certain use cases, but it’s not always cheaper than classic clusters — for example, if your job runs for a long time or needs tight control over the environment, classic compute may be more cost-effective.

So yes — ideally, the infra team should own cost governance. But from my side, I try to:

Be cost-aware when designing pipelines or training model.
Optimize Spark code (e.g. caching, avoiding wide transformations, etc.)
Choose the right compute layer (classic vs serverless vs job clusters) depending on the workload

Cloud is a shift in mindset — it gives you power and flexibility, but also means everyone plays a small part in cost efficiency, especially at scale.

Gema 👩‍💻

View solution in original post

Gecofer · ‎10-12-2025

Hey Benedict!

That’s actually a great question and one that a lot of people have when they come from a traditional ETL background.

Before diving in, can I ask which cloud you’re using? (AWS, Azure, or GCP?) — because each one has its own native tools (like AWS Glue, Azure Data Factory, or Google Dataflow), and the best way to explain Databricks is by comparing it to the specific tools you already know.

But let’s make it really simple for now:

Imagine your cloud provider is like a big supermarket. It has everything — one shelf for data ingestion, another for data cleaning, another for machine learning, another for dashboards...
Now imagine Databricks as your kitchen inside that supermarket. You can grab any ingredient from any shelf (AWS, Azure, or GCP storage, databases, or APIs) and then cook your data end-to-end in one place — clean it, transform it, analyze it, and even train AI models — without leaving the kitchen.

Why Databricks for ETL

Sure, you can do ETL in cloud-native tools like Glue, Data Factory, or Dataflow… but:

Those tools usually follow a “black box” pattern: you configure jobs, but you don’t really see what’s happening behind the scenes.
They’re great for pipelines, but less flexible for data exploration, debugging, or incremental data transformations.

With Databricks, your ETL lives in notebooks — so you can combine SQL, Python, or Spark seamlessly. You can prototype, test, and productionize in the same environment.

Plus, because Databricks runs on top of Apache Spark, it can handle massive amounts of data efficiently. It’s the same engine used by cloud ETL tools under the hood — but in Databricks, you have full control.

Why Databricks for ML & AI

Cloud providers have their own ML platforms (SageMaker, Azure ML, Vertex AI), but Databricks adds something very powerful — MLflow — which is built-in. That means you can:

Track all your experiments (code, parameters, metrics, models) automatically.
Version your models the same way you version code.
Serve your models with just one click using Databricks Model Serving.
Move from ETL to training to deployment without switching tools or teams.

It’s like having the entire MLOps lifecycle inside one platform — unified, reproducible, and auditable.

Why Databricks for SQL & Data Engineering

Databricks also includes Databricks SQL, which lets you query your data lake directly as if it were a data warehouse. You can:

Run fast, optimized queries using Photon, the high-performance execution engine.
Create dashboards and alerts directly from the Lakehouse.
Mix SQL and Python — great for data engineers and analysts working together.

In a way, Databricks bridges the gap between data lakes and data warehouses — that’s why it’s called a Lakehouse.

In short

The cloud gives you the infrastructure (storage, compute, security).
Databricks gives you the brain — the collaborative layer that unifies ETL, analytics, and AI in one environment.
It works on top of your cloud, so you still use your existing resources and governance.

You can think of it as turning your raw cloud storage into a full data & AI platform.

Hope that helps clear things up.

Gema 👩‍💻

BenBricks · ‎10-12-2025

Thanks a lot Gema. The first example made it a lot clearer.

I just enrolled for the October Learning Festival and I am trying out DBx in our company's sandbox on Azure.

Again, in a traditional warehouse I spend on infra and then there's no running cost for memory or compute. But with cloud tools, the developers and users are to be constantly aware of this. Is this not a unecessary overhead. Shouldnt this worry be the Infra team's alone? Why involve the devs and users to keep an eye on infra costs?

Gecofer · ‎10-13-2025

Thanks for your follow-up! That’s a really good and fair question — especially for folks coming from traditional warehouses or on-prem environments.

In most projects I’ve worked on, it’s true that the platform or infra team is responsible for cost control, monitoring usage patterns, and setting alerts if a specific team or job suddenly spikes in consumption.

However, as a developer, DE or ML engineer, I personally still try to stay aware of my own resource usage, even if I don’t always know the exact €/$ cost of a given job or training run. Why? Because in cloud environments:

Everything is elastic and on-demand, which is great for flexibility but it also means every compute second and every GB of memory has a price tag attached.

So my own rule of thumb is:
Always start with the smallest cluster possible, and only scale up if the code/job really needs it.

I avoid using the largest clusters "just in case" — it’s better to monitor memory usage, optimize logic, and grow gradually if needed.

Databricks also uses its own cost unit called DBUs (Databricks Units), which makes it easier to track costs in a normalized way across workloads — but keep in mind that the actual machines are running on your cloud provider (Azure, AWS, GCP). Also, serverless compute in Databricks can be great for certain use cases, but it’s not always cheaper than classic clusters — for example, if your job runs for a long time or needs tight control over the environment, classic compute may be more cost-effective.

So yes — ideally, the infra team should own cost governance. But from my side, I try to:

Be cost-aware when designing pipelines or training model.
Optimize Spark code (e.g. caching, avoiding wide transformations, etc.)
Choose the right compute layer (classic vs serverless vs job clusters) depending on the workload

Cloud is a shift in mindset — it gives you power and flexibility, but also means everyone plays a small part in cost efficiency, especially at scale.

Gema 👩‍💻