Streaming delta table - Performance with incremental refresh

Fnazar — Wed, 31 Jan 2024 11:15:45 GMT

Hi Team,

We are hitting performance issues with Streaming live delta table specifically when evaluating large tables of more than 10million rows.
What are the workarounds to handle these streaming live tables in an attempt to load these large tables.
Also, if we can use partition by then help me with the syntax please

Thanks

Re: Streaming delta table - Performance with incremental refresh

Priyanka_Biswas — Thu, 01 Feb 2024 01:24:09 GMT

Hi @Fnazar

When dealing with streaming data, you might end up with many small files, which can be inefficient. Use Delta Lake's OPTIMIZE command to compact files into larger ones and ZORDER to colocate related information in the same set of files. This is particularly useful for columns that are often queried together.

Select a column that results in evenly distributed data. Common choices include dates (for time-based data) or some form of categorical data that is well balanced.

When creating or writing to a Delta table, you can specify the partitioning using the PARTITION BY clause. For instance, if you're partitioning by a date column: df.write.format("delta").partitionBy("date_column").save("/mnt/delta/my_table")

This command will create partitions in the Delta table based on unique values in the date_column

If you're ingesting streaming data into Delta Lake, consider using Auto Loader for efficient and incremental processing of new data.

https://docs.delta.io/latest/best-practices.html

https://docs.databricks.com/en/sql/language-manual/delta-optimize.html

topic Re: Streaming delta table - Performance with incremental refresh in Data Engineering

Streaming delta table - Performance with incremental refresh

Re: Streaming delta table - Performance with incremental refresh