Databricks

Dunken · ‎01-27-2022

Our architecture is according to Domain Driven Design. The data is therefore distributed among different domains.

We would like to run workloads on top of our data but we would like to avoid to have a dedicated (duplicated) data lake just for Databricks. Instead we would rather like to directly rely on our own data sources (accessible via REST APIs) in order to always run on the same, latest data.

Could anybody point me to some resources in order to get started? It would definitively be fine to have an abstraction layer between what we use in a notebook and how our backend APIs look like...

-werners- · ‎01-27-2022

So basically you do not want to persist data outside of your source systems.

I think the so called 'Kappa architecture' could be a fit, where everything is treated like a stream.

Hubert already mentioned Kafka, which is an excellent source to build this (there are also others). And on top of that you could use Spark, or Flink or whatever.

There is also Apache Nifi and Streamsets and ...

Kappa architecture is pretty cool, but not without it's flaws.

There is also the pretty recent 'Data mesh', where providing data is seen as domain-based. This could be a match for your use case.

But this approach of course also has it's flaws (governance, gigantic overhead f.e.)

View solution in original post

Hubert-Dudek · ‎01-27-2022

You can just use urlopen or requests and than read json as a dataframe using spark.json(). Problem is that in that case you will need handle whole logic (when to load data, how to handle incremental load etc).

Easier solution o is to use streaming and put kafka with data from your API ( confluent.io can be registered also through Azure) or any other stream like eventHubs and than your newest data can be read as kafka stream in databricks and processed data will be saved in destination of your choice. Maybe on side of your infrastructure you can just deploy microservice which read from rest apis and save to stream.

Dunken · ‎01-28-2022

Thanks. If I would use streaming I would replicate all my data sources, isn't it? This is actually something I would like to avoid... also, because I don't know up-front in which data I'm interested in I would have to store everything at Databricks.

-werners- · ‎01-31-2022

if you really want to avoid replicating data (so this means reporting directly on your source systems), you can look into Presto or Trino or Dremio etc.

Kaniz · ‎02-05-2022

Hi @Armin Galliker , Did @Werner Stinckens 's reply answer your question?

If yes, would you like to mark his answer as the best?