Saturday
Hi,
I come from a traditional ETL background and am having trouble understanding some of the cloud hyper scalar features and use cases.
I understand Databricks is hosted on a cloud providers. I see the cloud providers have their own tools for ETL, ML/AI etc. So what is the advantage or use case for using Databricks over the cloud providers tools.
Please explain like you would to a 5yr old :). As I said I am totally new to this space.
-Benedict
Sunday
Hey Benedict!
That’s actually a great question and one that a lot of people have when they come from a traditional ETL background.
Before diving in, can I ask which cloud you’re using? (AWS, Azure, or GCP?) — because each one has its own native tools (like AWS Glue, Azure Data Factory, or Google Dataflow), and the best way to explain Databricks is by comparing it to the specific tools you already know.
But let’s make it really simple for now:
Sure, you can do ETL in cloud-native tools like Glue, Data Factory, or Dataflow… but:
With Databricks, your ETL lives in notebooks — so you can combine SQL, Python, or Spark seamlessly. You can prototype, test, and productionize in the same environment.
Plus, because Databricks runs on top of Apache Spark, it can handle massive amounts of data efficiently. It’s the same engine used by cloud ETL tools under the hood — but in Databricks, you have full control.
Cloud providers have their own ML platforms (SageMaker, Azure ML, Vertex AI), but Databricks adds something very powerful — MLflow — which is built-in. That means you can:
It’s like having the entire MLOps lifecycle inside one platform — unified, reproducible, and auditable.
Databricks also includes Databricks SQL, which lets you query your data lake directly as if it were a data warehouse. You can:
In a way, Databricks bridges the gap between data lakes and data warehouses — that’s why it’s called a Lakehouse.
You can think of it as turning your raw cloud storage into a full data & AI platform.
Hope that helps clear things up.
Gema 👩💻
Monday
Thanks for your follow-up! That’s a really good and fair question — especially for folks coming from traditional warehouses or on-prem environments.
In most projects I’ve worked on, it’s true that the platform or infra team is responsible for cost control, monitoring usage patterns, and setting alerts if a specific team or job suddenly spikes in consumption.
However, as a developer, DE or ML engineer, I personally still try to stay aware of my own resource usage, even if I don’t always know the exact €/$ cost of a given job or training run. Why? Because in cloud environments:
Everything is elastic and on-demand, which is great for flexibility but it also means every compute second and every GB of memory has a price tag attached.
So my own rule of thumb is:
Always start with the smallest cluster possible, and only scale up if the code/job really needs it.
I avoid using the largest clusters "just in case" — it’s better to monitor memory usage, optimize logic, and grow gradually if needed.
Databricks also uses its own cost unit called DBUs (Databricks Units), which makes it easier to track costs in a normalized way across workloads — but keep in mind that the actual machines are running on your cloud provider (Azure, AWS, GCP). Also, serverless compute in Databricks can be great for certain use cases, but it’s not always cheaper than classic clusters — for example, if your job runs for a long time or needs tight control over the environment, classic compute may be more cost-effective.
So yes — ideally, the infra team should own cost governance. But from my side, I try to:
Cloud is a shift in mindset — it gives you power and flexibility, but also means everyone plays a small part in cost efficiency, especially at scale.
Gema 👩💻
Sunday
Hey Benedict!
That’s actually a great question and one that a lot of people have when they come from a traditional ETL background.
Before diving in, can I ask which cloud you’re using? (AWS, Azure, or GCP?) — because each one has its own native tools (like AWS Glue, Azure Data Factory, or Google Dataflow), and the best way to explain Databricks is by comparing it to the specific tools you already know.
But let’s make it really simple for now:
Sure, you can do ETL in cloud-native tools like Glue, Data Factory, or Dataflow… but:
With Databricks, your ETL lives in notebooks — so you can combine SQL, Python, or Spark seamlessly. You can prototype, test, and productionize in the same environment.
Plus, because Databricks runs on top of Apache Spark, it can handle massive amounts of data efficiently. It’s the same engine used by cloud ETL tools under the hood — but in Databricks, you have full control.
Cloud providers have their own ML platforms (SageMaker, Azure ML, Vertex AI), but Databricks adds something very powerful — MLflow — which is built-in. That means you can:
It’s like having the entire MLOps lifecycle inside one platform — unified, reproducible, and auditable.
Databricks also includes Databricks SQL, which lets you query your data lake directly as if it were a data warehouse. You can:
In a way, Databricks bridges the gap between data lakes and data warehouses — that’s why it’s called a Lakehouse.
You can think of it as turning your raw cloud storage into a full data & AI platform.
Hope that helps clear things up.
Gema 👩💻
Sunday
Thanks a lot Gema. The first example made it a lot clearer.
I just enrolled for the October Learning Festival and I am trying out DBx in our company's sandbox on Azure.
Again, in a traditional warehouse I spend on infra and then there's no running cost for memory or compute. But with cloud tools, the developers and users are to be constantly aware of this. Is this not a unecessary overhead. Shouldnt this worry be the Infra team's alone? Why involve the devs and users to keep an eye on infra costs?
Monday
Thanks for your follow-up! That’s a really good and fair question — especially for folks coming from traditional warehouses or on-prem environments.
In most projects I’ve worked on, it’s true that the platform or infra team is responsible for cost control, monitoring usage patterns, and setting alerts if a specific team or job suddenly spikes in consumption.
However, as a developer, DE or ML engineer, I personally still try to stay aware of my own resource usage, even if I don’t always know the exact €/$ cost of a given job or training run. Why? Because in cloud environments:
Everything is elastic and on-demand, which is great for flexibility but it also means every compute second and every GB of memory has a price tag attached.
So my own rule of thumb is:
Always start with the smallest cluster possible, and only scale up if the code/job really needs it.
I avoid using the largest clusters "just in case" — it’s better to monitor memory usage, optimize logic, and grow gradually if needed.
Databricks also uses its own cost unit called DBUs (Databricks Units), which makes it easier to track costs in a normalized way across workloads — but keep in mind that the actual machines are running on your cloud provider (Azure, AWS, GCP). Also, serverless compute in Databricks can be great for certain use cases, but it’s not always cheaper than classic clusters — for example, if your job runs for a long time or needs tight control over the environment, classic compute may be more cost-effective.
So yes — ideally, the infra team should own cost governance. But from my side, I try to:
Cloud is a shift in mindset — it gives you power and flexibility, but also means everyone plays a small part in cost efficiency, especially at scale.
Gema 👩💻
Monday
Thanks a lot Gema. For the detailed and meticulous answers.
I guess I have to unlearn and relearn everything starting today.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now