07-25-2022 05:20 AM
Hello team!
As per the documentation, I understand that the table statistics can be fetched through the delta log (eg min, max, count) in order to not read the underlying data of a delta table.
This is the case for numerical types, and timestamp is supposed to be supported.
In my test example though, when it comes to the timestamp column, the data are fetched as shown by the query plan, instead of reading the delta_log.
Any thoughts on how to make spark read the max values of the timestamp through the delta log?
Delta table creation:
Seq(
(1, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now())),
(2, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now().minusDays(1))),
(3, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now().minusDays(2)))
)
.toDF("value", "ts")
.write
.saveAsTable("test.test_ts")
As expected, max for column value is read by the log:
but this is not the case for the ts column:
while the delta log contains the needed information (max of ts column):
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"9f6cbb6f-e866-4279-89a3-609bacf175cb","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ts\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1658749213255}}
{"add":{"path":"part-00000-0848b536-6c7b-4429-b168-44d53ec48ba4-c000.snappy.parquet","partitionValues":{},"size":852,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":1,\"ts\":\"2022-07-25T11:40:12.884Z\"},\"maxValues\":{\"value\":1,\"ts\":\"2022-07-25T11:40:12.884Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000000","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00001-0c883bda-bfc3-4921-a46f-32eaa9fb7dbc-c000.snappy.parquet","partitionValues":{},"size":852,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":2,\"ts\":\"2022-07-24T11:40:12.885Z\"},\"maxValues\":{\"value\":2,\"ts\":\"2022-07-24T11:40:12.885Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000001","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00002-54975e9b-3df8-4382-afa0-2929778ecb0a-c000.snappy.parquet","partitionValues":{},"size":853,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":3,\"ts\":\"2022-07-23T11:40:12.885Z\"},\"maxValues\":{\"value\":3,\"ts\":\"2022-07-23T11:40:12.885Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000002","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"commitInfo":{"timestamp":1658749218051,"userId":"6999390541537531","userName":"p.maroudis@kaizengaming.com","operation":"CREATE TABLE AS SELECT","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"notebook":{"notebookId":"2851472110292823"},"clusterId":"0920-080342-who267","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"3","numOutputRows":"3","numOutputBytes":"2557"},"engineInfo":"Databricks-Runtime/10.4.x-scala2.12","txnId":"775ac353-9df2-45a2-99d2-1a15783c0e40"}}
07-26-2022 03:09 AM
After communication with our CSE in Databricks, apparently there was an issue with truncation and precision loss that made them disactivate this feature..
Actually I also had this issue on another project when trying to parse timestamp fields using json4s. It is using the SimpleDateFormat class to parse timestamps that does not support microsecond precision.. maybe this is related..
@Werner Stinckens thanks again for your help!
07-26-2022 12:32 AM
are you sure the timestamp column is a valid spark-timestamp-type?
07-26-2022 02:39 AM
07-26-2022 02:47 AM
Strange.
Have you checked the actual plan (spark ui)?
07-26-2022 02:54 AM
Yep.. attached it in my question.. Plus it's obvious by the number of tasks it is spawning.. it reads the whole table (more obvious in big tables with more read partitions).
Can you reproduce it on your side as well or is it just us?
07-26-2022 03:04 AM
same here
07-26-2022 03:09 AM
After communication with our CSE in Databricks, apparently there was an issue with truncation and precision loss that made them disactivate this feature..
Actually I also had this issue on another project when trying to parse timestamp fields using json4s. It is using the SimpleDateFormat class to parse timestamps that does not support microsecond precision.. maybe this is related..
@Werner Stinckens thanks again for your help!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group