Databricks Community

amoralca · ‎08-13-2024

Hey everyone, I’m currently working on a project where my team is thinking about using Databricks as a transactional database for our backend application. We're familiar with Databricks for analytics and big data processing, but we're not sure if it’s the right fit for handling real-time transactional workloads. Has anyone in the community successfully used Databricks for this purpose? Is it a good idea, or would it be better to stick with traditional transactional databases? If you have any experience, success stories, or advice, I’d really appreciate hearing about it. Looking forward to your insights! Best,

szymon_dybczak · ‎08-13-2024

Hi @amoralca ,

Databricks is mainly used for Big data processing. In my opinion it's not the best choice for OLTP database. You spin all those cluster nodes, but then your workload is transactional in nature so you're wasting all that compute power.

Additionally, lakehouse is heavily dependent on 'big data file formats' like parquet, delta lake, orc, iceberg etc.These are typically immutable.In an oltp system you have to do a lot of small synchrone updates which is cumbersome in a lakehouse

But this is interesting question and I'd like to hear more voices on this topic.

Retired_mod · ‎08-14-2024

Hi @amoralca, Thanks for reaching out! Please review the responses and let us know which best addresses your question. Your feedback is valuable to us and the community. If the response resolves your issue, kindly mark it as the accepted solution. This will help close the thread and assist others with similar queries. We appreciate your participation and are here if you need further assistance!

Edthehead · ‎08-14-2024

My 2 cents, Databricks Lakehouse is like a DWH which is similar to Azure Synapse dedicated pool and meant for a certain purpose. With all that power comes a limitation in concurrency and number of queries that can run in parallel. So, it's great if you are loading large data into it or performing analytical queries. But if you are going to have 100s-1000s of queries and inserts, I do not see it as a good fit. These queries and single inserts will not be using spark at all. Normal SQL DBs come with comparatively lower storage limits but have good concurrency for small queries and inserts. Technically though, you can still use a Databricks lakehouse as a OLTP DB.

movmarcos · ‎12-12-2024

I have a similar situation in my data quality check process. During this stage, I frequently find errors or potential issues that can stop the pipeline. Each of these errors requires manual intervention, which might involve making edits or supplying justifications for the discrepancies. Once all the issues are resolved, the pipeline can resume its operation without any problems.

I was considering two options: