How do we manage data recency in Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-21-2021 05:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-22-2021 05:43 PM
When using delta tables in databricks, you have the advantage of delta cache which accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. At the beginning of each query delta tables auto-update to the latest version - this way data is always recent.
However, if it is acceptable for results to be stale for a short duration of time, you could lower the latency of queries further. This is done by setting the Spark session configuration variable spark.databricks.delta.stalenessLimit with a time string value, e.g 1h, 15m, 1d