Optimal Azure VM type for EventHub streaming

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Hello,

our spark jobs stream messages from Event Hub then transform it and finally the messages are peristed in storage. We plan to exercise cluster configurations for these jobs in order to find the optimal and procure Azure reservations. Furtemore, it is important to optimize the operaiton of these jobs in order to make a balanced usage of the DBUs.

1/ Do we need Multi-Node or Single Node configuration?

2/ What VM type is recomended for such workloads? Do we need memory or compute optimized or a balanced VM?

3/ If Multi-Node is recomended what type of VM for the Driver and what for the Workers?

4/ Does it make sense to enable AutoScaling?

5/ Does it make sense to enable Photon?

6/ Shall we run continuously the job or trigger it frequently?

thanks