cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta log statistics - timestamp type not working

pantelis_mare
Contributor III

Hello team!

As per the documentation, I understand that the table statistics can be fetched through the delta log (eg min, max, count) in order to not read the underlying data of a delta table.

This is the case for numerical types, and timestamp is supposed to be supported.

In my test example though, when it comes to the timestamp column, the data are fetched as shown by the query plan, instead of reading the delta_log.

Any thoughts on how to make spark read the max values of the timestamp through the delta log?

Delta table creation:

Seq(
  (1, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now())), 
  (2, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now().minusDays(1))),
  (3, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now().minusDays(2)))
  )
.toDF("value", "ts")
.write
.saveAsTable("test.test_ts")

As expected, max for column value is read by the log:

max valuebut this is not the case for the ts column:

image.pngwhile the delta log contains the needed information (max of ts column):

{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"9f6cbb6f-e866-4279-89a3-609bacf175cb","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ts\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1658749213255}}
{"add":{"path":"part-00000-0848b536-6c7b-4429-b168-44d53ec48ba4-c000.snappy.parquet","partitionValues":{},"size":852,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":1,\"ts\":\"2022-07-25T11:40:12.884Z\"},\"maxValues\":{\"value\":1,\"ts\":\"2022-07-25T11:40:12.884Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000000","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00001-0c883bda-bfc3-4921-a46f-32eaa9fb7dbc-c000.snappy.parquet","partitionValues":{},"size":852,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":2,\"ts\":\"2022-07-24T11:40:12.885Z\"},\"maxValues\":{\"value\":2,\"ts\":\"2022-07-24T11:40:12.885Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000001","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00002-54975e9b-3df8-4382-afa0-2929778ecb0a-c000.snappy.parquet","partitionValues":{},"size":853,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":3,\"ts\":\"2022-07-23T11:40:12.885Z\"},\"maxValues\":{\"value\":3,\"ts\":\"2022-07-23T11:40:12.885Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000002","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"commitInfo":{"timestamp":1658749218051,"userId":"6999390541537531","userName":"p.maroudis@kaizengaming.com","operation":"CREATE TABLE AS SELECT","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"notebook":{"notebookId":"2851472110292823"},"clusterId":"0920-080342-who267","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"3","numOutputRows":"3","numOutputBytes":"2557"},"engineInfo":"Databricks-Runtime/10.4.x-scala2.12","txnId":"775ac353-9df2-45a2-99d2-1a15783c0e40"}}

1 ACCEPTED SOLUTION

Accepted Solutions

After communication with our CSE in Databricks, apparently there was an issue with truncation and precision loss that made them disactivate this feature..

Actually I also had this issue on another project when trying to parse timestamp fields using json4s. It is using the SimpleDateFormat class to parse timestamps that does not support microsecond precision.. maybe this is related..

@Werner Stinckens​ thanks again for your help!

View solution in original post

6 REPLIES 6

-werners-
Esteemed Contributor III

are you sure the timestamp column is a valid spark-timestamp-type?

Yes! as the schema shows.. I also added an explicit cast to TimestampType (even though I create it as a java.sql.Timestamp anyway...

imageDo you also get the same behavior? at least in my test case...

-werners-
Esteemed Contributor III

Strange.

Have you checked the actual plan (spark ui)?

Yep.. attached it in my question.. Plus it's obvious by the number of tasks it is spawning.. it reads the whole table (more obvious in big tables with more read partitions).

Can you reproduce it on your side as well or is it just us?

-werners-
Esteemed Contributor III

same here

After communication with our CSE in Databricks, apparently there was an issue with truncation and precision loss that made them disactivate this feature..

Actually I also had this issue on another project when trying to parse timestamp fields using json4s. It is using the SimpleDateFormat class to parse timestamps that does not support microsecond precision.. maybe this is related..

@Werner Stinckens​ thanks again for your help!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group