- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2022 05:20 AM
Hello team!
As per the documentation, I understand that the table statistics can be fetched through the delta log (eg min, max, count) in order to not read the underlying data of a delta table.
This is the case for numerical types, and timestamp is supposed to be supported.
In my test example though, when it comes to the timestamp column, the data are fetched as shown by the query plan, instead of reading the delta_log.
Any thoughts on how to make spark read the max values of the timestamp through the delta log?
Delta table creation:
Seq(
(1, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now())),
(2, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now().minusDays(1))),
(3, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now().minusDays(2)))
)
.toDF("value", "ts")
.write
.saveAsTable("test.test_ts")As expected, max for column value is read by the log:
but this is not the case for the ts column:
while the delta log contains the needed information (max of ts column):
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"9f6cbb6f-e866-4279-89a3-609bacf175cb","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ts\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1658749213255}}
{"add":{"path":"part-00000-0848b536-6c7b-4429-b168-44d53ec48ba4-c000.snappy.parquet","partitionValues":{},"size":852,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":1,\"ts\":\"2022-07-25T11:40:12.884Z\"},\"maxValues\":{\"value\":1,\"ts\":\"2022-07-25T11:40:12.884Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000000","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00001-0c883bda-bfc3-4921-a46f-32eaa9fb7dbc-c000.snappy.parquet","partitionValues":{},"size":852,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":2,\"ts\":\"2022-07-24T11:40:12.885Z\"},\"maxValues\":{\"value\":2,\"ts\":\"2022-07-24T11:40:12.885Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000001","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00002-54975e9b-3df8-4382-afa0-2929778ecb0a-c000.snappy.parquet","partitionValues":{},"size":853,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":3,\"ts\":\"2022-07-23T11:40:12.885Z\"},\"maxValues\":{\"value\":3,\"ts\":\"2022-07-23T11:40:12.885Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000002","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"commitInfo":{"timestamp":1658749218051,"userId":"6999390541537531","userName":"p.maroudis@kaizengaming.com","operation":"CREATE TABLE AS SELECT","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"notebook":{"notebookId":"2851472110292823"},"clusterId":"0920-080342-who267","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"3","numOutputRows":"3","numOutputBytes":"2557"},"engineInfo":"Databricks-Runtime/10.4.x-scala2.12","txnId":"775ac353-9df2-45a2-99d2-1a15783c0e40"}}