cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Design Question

Frank
New Contributor III

we have an application that takes in raw metrics data like key-value pairs.

then we split them into four different table like below

`key1, min, max, average`

Those four tables are later used for dashboard.

  1. What are the design recommendations to this? Shall we change the schema?
  2. When data is ingested in, there seems to 2s delay for everytime there is a SQL command. In the ingestion endpoint, we will have to write to four tables and also insert to raw tables, those will cause about 2*5=10s which is really long. How can we minimize the ingest time?
  3. What is the recommended data ingestion pattern? We currently use http post to a server and then server write to a database. But Delta seems to be slow in this case.

1 REPLY 1

stefnhuy
New Contributor III

Hey,

I can totally relate to the challenges Frank is facing with this application'**bleep** data processing. It'**bleep** frustrating to deal with delays, especially when dealing with real-time metrics. I've had a similar experience where optimizing data ingestion was crucial.

Considering the design, using separate tables for 'min', 'max', and 'average' is a good start for dashboard efficiency. However, the 2-second delay per SQL command seems like a bottleneck. Have you thought about batch processing instead of individual inserts? Combining multiple commands into one batch could significantly reduce overhead. If you haven't heard of it before, I suggest you read this article: Cross Platform App Design: Discover The Solid UI Design Guidelines.

Regarding the ingestion pattern, HTTP POST to a server is convenient, but if Delta'**bleep** slow, exploring other technologies like Apache Kafka might be worth it. It'**bleep** designed for high-throughput, real-time data streaming.

Changing the schema might help, but first, analyze the read vs. write frequency. If reads are more frequent, consider optimizing the dashboard queries.

Remember, it'**bleep** a trial-and-error process. I'd love to hear how others dealt with similar challenges and what worked best for them.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group