topic Re: Streaming Reads Full Table with Liquid Clustering in Data Engineering

Streaming Reads Full Table with Liquid Clustering

bampo — Sat, 25 May 2024 08:25:09 GMT

Each merge/update on a table with liquid clustering force the streaming to read whole table.
Databricks Runtime: 14.3 LTS

Below I prepare a simple scripts to reproduce the issue:

Create schema.

%sql CREATE SCHEMA IF NOT EXISTS test;

Create table with simple structure.
Timestamp of a value, id, value, year/month/day (for clustering).
Table is clustered by year+month+day and pointId.
deletion vectors enabled.

%sql CREATE TABLE IF NOT EXISTS test.test_values ( tsUtc TIMESTAMP NOT NULL, pointId STRING NOT NULL, value DOUBLE NOT NULL, pYear INT NOT NULL, pMonth INT NOT NULL, pDay INT NOT NULL, tsInserted TIMESTAMP NOT NULL ) USING delta CLUSTER BY (pYear, pMonth, pDay, pointId) TBLPROPERTIES ('delta.enableDeletionVectors' = true) ;

Start a stream to read test.test_values table and print the count of retrieved records.

%python from pyspark.sql import DataFrame _table_name = "test.test_values" _source_df = spark.readStream.format("delta").option("ignoreChanges", "true").table(_table_name) def upsert_batch(df: DataFrame, batch_id: int): print(f"Number of records: {df.count()}") _query_name = "test_query_1" _processing_time = "30 seconds" sink = (_source_df .writeStream .foreachBatch(upsert_batch) .trigger(processingTime=_processing_time) .queryName(_query_name) .start()) sink.awaitTermination()

Insert test data for 10 different point, with 1000 records of randomly generated timestamp and value.

WITH seq AS ( SELECT explode(sequence(1, 1000)) AS id ), points AS ( SELECT explode(array( 'point1', 'point2', 'point3', 'point4', 'point5', 'point6', 'point7', 'point8', 'point9', 'point10') ) AS pointId ), cross_joined AS ( SELECT p.pointId, s.id FROM points p CROSS JOIN seq s ), randomized AS ( SELECT pointId, current_timestamp() AS tsUtc, rand() as value, '2024' AS pYear, '5' AS pMonth, '24' AS pDay, current_timestamp() AS tsInserted FROM cross_joined ) INSERT INTO test.test_values(tsUtc, pointId, value, pYear, pMonth, pDay, tsInserted) SELECT tsUtc, pointId, value, pYear, pMonth, pDay, tsInserted FROM randomized;

After that stream reads 10k records, which is correct.
Number of records: 10000

After "identified" an incorrect data, and update should be performed for all values for pointId=point1 and whole day of the 24th of May.

%sql update test.test_values set tsInserted=now(), value = value + 1 where pointId="point1" and pyear=2024 and pmonth=5 and pday=24;

The expected number of records retrieved by stream is 1000, the exact number of updated records (each point have 1k records per day).
But in fact the retrieved numebr of records is 10000;
Number of records: 10000

Performing the same update for same point produces correct result:
Number of records: 1000

But updating data for different pointId again produces not expected result.

%sql update test.test_values set tsInserted=now(), value = value + 1 where pointId="point5" and pyear=2024 and pmonth=5 and pday=24;

Number of records: 9000

Recreating the table with the same structure, but partitioned (PARTITIONED BY (pYear, pMonth, pDay, pointId)) instead of clustered produces correct expected results.

What could be the reason of such behavior?

Re: Streaming Reads Full Table with Liquid Clustering

daniel_sahal — Mon, 27 May 2024 07:37:54 GMT

@bampo
Partitioning is splitting the data in multiple parquet files, that's why it's reading the table partially.
Liquid clustering behaves differently. With that small amount of data, it's keeping all of the data in the single file, because it's the most efficient way of handling a small data volume.

This video explains what liquid clustering is about - https://www.youtube.com/watch?v=5t6wX28JC_M

Re: Streaming Reads Full Table with Liquid Clustering

bampo — Mon, 27 May 2024 08:18:42 GMT

Thanks for the reply!

But it doesn't explain why after the first update whole table is returned (10k records), and after second update (of the same data) only updated records are returned (1k records).

Also worth to mention, that in my example I use small amount of data only for showcase reason. On our test environment the table consists of more than 30 millions of records, and after updating 200 of them, stream reads the whole table (30m+ records) which is unacceptable.

Re: Streaming Reads Full Table with Liquid Clustering

radothede — Mon, 27 May 2024 08:46:22 GMT

It seems You are not using checkpoint location, is that intended?

https://docs.databricks.com/en/structured-streaming/query-recovery.html

That might be the reason your streaming query is reading the whole table every time You trigger the process.

Re: Streaming Reads Full Table with Liquid Clustering

bampo — Mon, 27 May 2024 08:58:16 GMT

Thanks for the reply,

In the example checkpoint is not used for simplification, but in real scenario it is used for sure.

Described scenario reproduces with a single trigger of a stream (without re-triggering the process), so checkpoint is not an issue in this case.