cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks and DDD

Dunken
New Contributor III

Our architecture is according to Domain Driven Design. The data is therefore distributed among different domains.

We would like to run workloads on top of our data but we would like to avoid to have a dedicated (duplicated) data lake just for Databricks. Instead we would rather like to directly rely on our own data sources (accessible via REST APIs) in order to always run on the same, latest data.

Could anybody point me to some resources in order to get started? It would definitively be fine to have an abstraction layer between what we use in a notebook and how our backend APIs look like...

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

So basically you do not want to persist data outside of your source systems.

I think the so called 'Kappa architecture' could be a fit, where everything is treated like a stream.

Hubert already mentioned Kafka, which is an excellent source to build this (there are also others). And on top of that you could use Spark, or Flink or whatever.

There is also Apache Nifi and Streamsets and ...

Kappa architecture is pretty cool, but not without it's flaws.

There is also the pretty recent 'Data mesh', where providing data is seen as domain-based. This could be a match for your use case.

But this approach of course also has it's flaws (governance, gigantic overhead f.e.)

View solution in original post

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

You can just use urlopen or requests and than read json as a dataframe using spark.json(). Problem is that in that case you will need handle whole logic (when to load data, how to handle incremental load etc).

Easier solution o is to use streaming and put kafka with data from your API ( confluent.io can be registered also through Azure) or any other stream like eventHubs and than your newest data can be read as kafka stream in databricks and processed data will be saved in destination of your choice. Maybe on side of your infrastructure you can just deploy microservice which read from rest apis and save to stream.

Dunken
New Contributor III

Thanks. If I would use streaming I would replicate all my data sources, isn't it? This is actually something I would like to avoid... also, because I don't know up-front in which data I'm interested in I would have to store everything at Databricks.

-werners-
Esteemed Contributor III

if you really want to avoid replicating data (so this means reporting directly on your source systems), you can look into Presto or Trino or Dremio etc.

Kaniz
Community Manager
Community Manager

Hi @Armin Galliker​ , Did @Werner Stinckens​ 's reply answer your question?

If yes, would you like to mark his answer as the best?

-werners-
Esteemed Contributor III

So basically you do not want to persist data outside of your source systems.

I think the so called 'Kappa architecture' could be a fit, where everything is treated like a stream.

Hubert already mentioned Kafka, which is an excellent source to build this (there are also others). And on top of that you could use Spark, or Flink or whatever.

There is also Apache Nifi and Streamsets and ...

Kappa architecture is pretty cool, but not without it's flaws.

There is also the pretty recent 'Data mesh', where providing data is seen as domain-based. This could be a match for your use case.

But this approach of course also has it's flaws (governance, gigantic overhead f.e.)

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.