Databricks

Mett · ‎05-24-2023

Hi,

My question will be more about the architecture solutions and potential implementation of these solutions.

The project :

Multiple microservices each one deployed on a Tomcat and each Tomcat is running in one of the tree VM existing in Azure.

Each microservice is a spring boot application exposing a couple of endpoints that could be consumed by other microservice or a final client (browser), some of those microservice will contain an embedded Spark.

The file storage is a folder in one of the VM hard disks, this folder is mounted in the other two servers.

Below is a workflow example for more clearty:

From their browser, a user can click on a function (Filter / Aggregate...) that will start executing a function in one of the service (service A), this function will execute a Spark SQL query in a file located in the file storage (folder in hard disk), and spark will generate a new file that will be the result returned to the Client.

**This service can also handle other request that doesn't need to use spark

The objective is to use spark in Azure Databricks instead of the embedded that we are using now, while keeping the services deployed on Tomcat.

For the file system I'm thing to move all the file storage to Azure File Share.

The most unclear point form me is how to use spark from Spring boot service while the cluster is on Databricks, knowing that spark jobs will be triggered by an end user from their browser any time he wants. Do I need only to change the config on spark connexion to point to the databricks server :

conf.setMaster("spark://databricks.server.url:port");

Or I should edit the code to submit the job (re-write the spark part in a separate class with a main method and submit it!) ?

Please feel free for any suggestion / recomandation, and thank you for your help and time.

If any information is messing, I will add it just let me know.

-werners- · ‎05-25-2023

My main question is why you use spark in the first place, but that is another discussion.

If you want to use databricks, you will have to launch a spark job in one way or another on the databricks cluster with spark-submit or by using the jobs api.

However, if you want to use this with your microservices, I guess you need an always-on databricks cluster, which will become expensive (unless you can wait several minutes for a cluster to spin up).

Or perhaps check out a streaming scenario?

Mett · ‎05-29-2023

Thank you for your answer.

We use spark because the application is like a customized ETL, user load file and he can apply as much transformation as he wants (for each transformation there’s a microservice in the background how executes it as SQL in spark). The POC at the beginning of the project proved that spark is better compared to the simple java app.

As I know with spark-submit the autoscaling isn’t possible, it is the same for Azure Databricks Jobs? If scalability is available for jobs API, starting a server with minimum config and use, scalability could be a correct solution?

I'm still analysing the streaming scenario...